Table of contents

How do I troubleshoot common jsoup errors?

Jsoup is a powerful Java library for parsing and manipulating HTML documents. While robust and reliable, developers may encounter various errors when working with jsoup. This comprehensive guide covers the most common issues and provides practical solutions to help you troubleshoot effectively.

Connection-Related Errors

1. Connection Timeouts (SocketTimeoutException)

Connection timeouts occur when the server takes too long to respond or when network connectivity is poor.

Error Message: java.net.SocketTimeoutException: Read timed out

Solutions:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.io.IOException;

// Basic timeout configuration
try {
    Document doc = Jsoup.connect("https://example.com")
        .timeout(30 * 1000) // 30 seconds timeout
        .get();
} catch (SocketTimeoutException e) {
    System.err.println("Connection timed out: " + e.getMessage());
    // Implement retry logic
}

// Advanced connection configuration
Document doc = Jsoup.connect("https://example.com")
    .timeout(15000)           // 15 second timeout
    .userAgent("Mozilla/5.0") // Set user agent
    .followRedirects(true)    // Follow redirects
    .maxBodySize(0)          // No limit on body size
    .get();

2. HTTP Status Errors (HttpStatusException)

Servers return HTTP error codes when requests fail or are rejected.

Common Error Codes: - 403 Forbidden - 404 Not Found - 429 Too Many Requests - 500 Internal Server Error

Solutions:

import org.jsoup.HttpStatusException;

try {
    Document doc = Jsoup.connect("https://example.com/page").get();
} catch (HttpStatusException e) {
    int statusCode = e.getStatusCode();
    String statusMessage = e.getMessage();

    switch (statusCode) {
        case 403:
            System.err.println("Access forbidden. Try different user agent or headers.");
            break;
        case 404:
            System.err.println("Page not found: " + e.getUrl());
            break;
        case 429:
            System.err.println("Rate limited. Implement delays between requests.");
            break;
        default:
            System.err.println("HTTP Error " + statusCode + ": " + statusMessage);
    }
}

// Ignore HTTP errors and process response anyway
Document doc = Jsoup.connect("https://example.com")
    .ignoreHttpErrors(true)
    .get();

3. SSL/TLS Certificate Issues

SSL handshake failures occur with invalid or self-signed certificates.

Error Message: javax.net.ssl.SSLHandshakeException

Solutions:

// Disable SSL validation (development only)
Document doc = Jsoup.connect("https://self-signed-example.com")
    .validateTLSCertificates(false)
    .get();

// Production-safe approach with custom SSL context
System.setProperty("javax.net.ssl.trustStore", "/path/to/truststore.jks");
System.setProperty("javax.net.ssl.trustStorePassword", "password");

Document doc = Jsoup.connect("https://example.com").get();

Parsing and Data Extraction Errors

4. Malformed HTML Parsing Issues

Jsoup handles most malformed HTML, but extreme cases may cause issues.

Solutions:

import org.jsoup.parser.Parser;

// Use XML parser for strict parsing
Document doc = Jsoup.connect("https://example.com")
    .parser(Parser.xmlParser())
    .get();

// Use HTML parser for lenient parsing (default)
Document doc = Jsoup.connect("https://example.com")
    .parser(Parser.htmlParser())
    .get();

// Parse HTML string with error recovery
String malformedHtml = "<div><p>Unclosed paragraph<div>Another div</div>";
Document doc = Jsoup.parse(malformedHtml);

5. CSS Selector Syntax Errors

Invalid CSS selectors throw Selector.SelectorParseException.

Common Issues: - Invalid pseudo-selectors - Malformed attribute selectors - Unsupported CSS3 selectors

Solutions:

import org.jsoup.select.Selector;

try {
    // Valid selectors
    Elements divs = doc.select("div.content");
    Elements links = doc.select("a[href^=https]");
    Elements items = doc.select("ul > li:nth-child(odd)");

    // Test selector validity
    if (Selector.select("div.invalid::pseudo", doc).isEmpty()) {
        System.out.println("No elements found");
    }
} catch (Selector.SelectorParseException e) {
    System.err.println("Invalid selector syntax: " + e.getMessage());

    // Fallback to simpler selector
    Elements fallback = doc.select("div");
}

// Debug selectors by testing incrementally
Elements test1 = doc.select("div");
Elements test2 = doc.select("div.content");
Elements test3 = doc.select("div.content > p");

6. Element Not Found or Empty Results

When selectors return no elements, the issue may be dynamic content or incorrect selectors.

Debugging Steps:

// Check if document loaded properly
if (doc == null || doc.html().isEmpty()) {
    System.err.println("Document is empty or null");
    return;
}

// Debug element selection
Elements target = doc.select("div.content");
if (target.isEmpty()) {
    // Try broader selectors
    Elements allDivs = doc.select("div");
    System.out.println("Found " + allDivs.size() + " div elements");

    // Print document structure for debugging
    System.out.println("Document title: " + doc.title());
    System.out.println("Body content preview: " + 
        doc.body().text().substring(0, Math.min(200, doc.body().text().length())));
}

// Check for JavaScript-rendered content
if (doc.select("noscript").size() > 0) {
    System.out.println("Page may require JavaScript rendering");
}

Memory and Performance Issues

7. OutOfMemoryError

Large documents can cause memory issues.

Solutions:

// Limit body size
Document doc = Jsoup.connect("https://large-site.com")
    .maxBodySize(1024 * 1024) // 1MB limit
    .get();

// Process documents in streaming fashion
Connection connection = Jsoup.connect("https://example.com");
Connection.Response response = connection.execute();

if (response.statusCode() == 200) {
    // Parse only specific parts
    Document doc = response.parse();
    Elements importantParts = doc.select("div.content");

    // Clear unnecessary elements
    doc.select("script, style, img").remove();
}

JVM Configuration:

# Increase heap size
java -Xmx2048m -XX:+UseG1GC YourApplication

# Monitor memory usage
java -XX:+PrintGCDetails -XX:+PrintGCTimeStamps YourApplication

Advanced Troubleshooting Techniques

8. Request Blocking and Anti-Bot Measures

Modern websites implement sophisticated blocking mechanisms.

Solutions:

// Realistic browser simulation
Document doc = Jsoup.connect("https://protected-site.com")
    .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36")
    .header("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8")
    .header("Accept-Language", "en-US,en;q=0.5")
    .header("Accept-Encoding", "gzip, deflate")
    .header("Connection", "keep-alive")
    .referrer("https://google.com")
    .get();

// Implement delays between requests
Thread.sleep(2000 + (int)(Math.random() * 3000)); // 2-5 second delay

// Session management with cookies
Map<String, String> cookies = new HashMap<>();
Connection.Response loginResponse = Jsoup.connect("https://site.com/login")
    .data("username", "user")
    .data("password", "pass")
    .method(Connection.Method.POST)
    .execute();

cookies.putAll(loginResponse.cookies());

Document protectedPage = Jsoup.connect("https://site.com/protected")
    .cookies(cookies)
    .get();

9. Encoding and Character Set Issues

Handle different character encodings properly.

Solutions:

// Specify encoding explicitly
Document doc = Jsoup.connect("https://international-site.com")
    .header("Accept-Charset", "UTF-8")
    .get();

// Parse with specific charset
String html = "..."; // HTML content as string
Document doc = Jsoup.parse(new ByteArrayInputStream(html.getBytes("UTF-8")), 
    "UTF-8", "https://example.com");

// Handle encoding detection
Connection.Response response = Jsoup.connect("https://example.com").execute();
String charset = response.charset();
if (charset == null) {
    charset = "UTF-8"; // fallback
}
Document doc = response.parse();

Error Handling Best Practices

Comprehensive Error Handling Template

import org.jsoup.*;
import org.jsoup.nodes.Document;
import java.io.IOException;
import java.net.SocketTimeoutException;

public class RobustJsoupScraper {

    public Document fetchDocument(String url, int maxRetries) {
        int attempts = 0;

        while (attempts < maxRetries) {
            try {
                return Jsoup.connect(url)
                    .timeout(30000)
                    .userAgent("Mozilla/5.0 (compatible; JavaScraper/1.0)")
                    .header("Accept", "text/html,application/xhtml+xml")
                    .followRedirects(true)
                    .ignoreHttpErrors(false)
                    .get();

            } catch (SocketTimeoutException e) {
                System.err.println("Timeout on attempt " + (attempts + 1) + ": " + e.getMessage());

            } catch (HttpStatusException e) {
                if (e.getStatusCode() == 429) {
                    // Rate limited - wait longer
                    try {
                        Thread.sleep(5000 * (attempts + 1));
                    } catch (InterruptedException ie) {
                        Thread.currentThread().interrupt();
                        break;
                    }
                } else {
                    System.err.println("HTTP error " + e.getStatusCode() + ": " + e.getMessage());
                    break; // Don't retry on other HTTP errors
                }

            } catch (IOException e) {
                System.err.println("Connection error: " + e.getMessage());

            } catch (Exception e) {
                System.err.println("Unexpected error: " + e.getMessage());
                e.printStackTrace();
                break;
            }

            attempts++;

            // Exponential backoff
            try {
                Thread.sleep(1000 * attempts);
            } catch (InterruptedException e) {
                Thread.currentThread().interrupt();
                break;
            }
        }

        return null; // All attempts failed
    }
}

Debugging Tools and Techniques

Enable Detailed Logging

// Add to your application
System.setProperty("org.jsoup.debug", "true");

// Custom logging
Logger logger = LoggerFactory.getLogger(YourClass.class);

try {
    Document doc = Jsoup.connect(url).get();
    logger.info("Successfully fetched: {} (size: {} chars)", url, doc.html().length());
} catch (Exception e) {
    logger.error("Failed to fetch: {}", url, e);
}

Document Analysis

// Analyze document structure
public void analyzeDocument(Document doc) {
    System.out.println("Title: " + doc.title());
    System.out.println("Base URI: " + doc.baseUri());
    System.out.println("Total elements: " + doc.getAllElements().size());
    System.out.println("Scripts: " + doc.select("script").size());
    System.out.println("Stylesheets: " + doc.select("link[rel=stylesheet]").size());
    System.out.println("Forms: " + doc.select("form").size());

    // Check for common frameworks
    if (doc.select("[ng-app], [data-ng-app]").size() > 0) {
        System.out.println("Angular detected - content may be dynamic");
    }
    if (doc.select("[id^=react], [data-react]").size() > 0) {
        System.out.println("React detected - content may be dynamic");
    }
}

Prevention and Best Practices

  1. Always handle exceptions appropriately for your use case
  2. Implement retry logic with exponential backoff for transient failures
  3. Respect robots.txt and website terms of service
  4. Use appropriate delays between requests to avoid overwhelming servers
  5. Monitor your scraping for changes in website structure
  6. Keep jsoup updated to benefit from bug fixes and improvements
  7. Test selectors thoroughly before deploying to production
  8. Implement logging to track errors and performance

By following these troubleshooting techniques and best practices, you'll be able to handle most jsoup errors effectively and build robust web scraping applications.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon