Jsoup is a powerful Java library for parsing and manipulating HTML documents. While robust and reliable, developers may encounter various errors when working with jsoup. This comprehensive guide covers the most common issues and provides practical solutions to help you troubleshoot effectively.
Connection-Related Errors
1. Connection Timeouts (SocketTimeoutException)
Connection timeouts occur when the server takes too long to respond or when network connectivity is poor.
Error Message: java.net.SocketTimeoutException: Read timed out
Solutions:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.io.IOException;
// Basic timeout configuration
try {
Document doc = Jsoup.connect("https://example.com")
.timeout(30 * 1000) // 30 seconds timeout
.get();
} catch (SocketTimeoutException e) {
System.err.println("Connection timed out: " + e.getMessage());
// Implement retry logic
}
// Advanced connection configuration
Document doc = Jsoup.connect("https://example.com")
.timeout(15000) // 15 second timeout
.userAgent("Mozilla/5.0") // Set user agent
.followRedirects(true) // Follow redirects
.maxBodySize(0) // No limit on body size
.get();
2. HTTP Status Errors (HttpStatusException)
Servers return HTTP error codes when requests fail or are rejected.
Common Error Codes: - 403 Forbidden - 404 Not Found - 429 Too Many Requests - 500 Internal Server Error
Solutions:
import org.jsoup.HttpStatusException;
try {
Document doc = Jsoup.connect("https://example.com/page").get();
} catch (HttpStatusException e) {
int statusCode = e.getStatusCode();
String statusMessage = e.getMessage();
switch (statusCode) {
case 403:
System.err.println("Access forbidden. Try different user agent or headers.");
break;
case 404:
System.err.println("Page not found: " + e.getUrl());
break;
case 429:
System.err.println("Rate limited. Implement delays between requests.");
break;
default:
System.err.println("HTTP Error " + statusCode + ": " + statusMessage);
}
}
// Ignore HTTP errors and process response anyway
Document doc = Jsoup.connect("https://example.com")
.ignoreHttpErrors(true)
.get();
3. SSL/TLS Certificate Issues
SSL handshake failures occur with invalid or self-signed certificates.
Error Message: javax.net.ssl.SSLHandshakeException
Solutions:
// Disable SSL validation (development only)
Document doc = Jsoup.connect("https://self-signed-example.com")
.validateTLSCertificates(false)
.get();
// Production-safe approach with custom SSL context
System.setProperty("javax.net.ssl.trustStore", "/path/to/truststore.jks");
System.setProperty("javax.net.ssl.trustStorePassword", "password");
Document doc = Jsoup.connect("https://example.com").get();
Parsing and Data Extraction Errors
4. Malformed HTML Parsing Issues
Jsoup handles most malformed HTML, but extreme cases may cause issues.
Solutions:
import org.jsoup.parser.Parser;
// Use XML parser for strict parsing
Document doc = Jsoup.connect("https://example.com")
.parser(Parser.xmlParser())
.get();
// Use HTML parser for lenient parsing (default)
Document doc = Jsoup.connect("https://example.com")
.parser(Parser.htmlParser())
.get();
// Parse HTML string with error recovery
String malformedHtml = "<div><p>Unclosed paragraph<div>Another div</div>";
Document doc = Jsoup.parse(malformedHtml);
5. CSS Selector Syntax Errors
Invalid CSS selectors throw Selector.SelectorParseException
.
Common Issues: - Invalid pseudo-selectors - Malformed attribute selectors - Unsupported CSS3 selectors
Solutions:
import org.jsoup.select.Selector;
try {
// Valid selectors
Elements divs = doc.select("div.content");
Elements links = doc.select("a[href^=https]");
Elements items = doc.select("ul > li:nth-child(odd)");
// Test selector validity
if (Selector.select("div.invalid::pseudo", doc).isEmpty()) {
System.out.println("No elements found");
}
} catch (Selector.SelectorParseException e) {
System.err.println("Invalid selector syntax: " + e.getMessage());
// Fallback to simpler selector
Elements fallback = doc.select("div");
}
// Debug selectors by testing incrementally
Elements test1 = doc.select("div");
Elements test2 = doc.select("div.content");
Elements test3 = doc.select("div.content > p");
6. Element Not Found or Empty Results
When selectors return no elements, the issue may be dynamic content or incorrect selectors.
Debugging Steps:
// Check if document loaded properly
if (doc == null || doc.html().isEmpty()) {
System.err.println("Document is empty or null");
return;
}
// Debug element selection
Elements target = doc.select("div.content");
if (target.isEmpty()) {
// Try broader selectors
Elements allDivs = doc.select("div");
System.out.println("Found " + allDivs.size() + " div elements");
// Print document structure for debugging
System.out.println("Document title: " + doc.title());
System.out.println("Body content preview: " +
doc.body().text().substring(0, Math.min(200, doc.body().text().length())));
}
// Check for JavaScript-rendered content
if (doc.select("noscript").size() > 0) {
System.out.println("Page may require JavaScript rendering");
}
Memory and Performance Issues
7. OutOfMemoryError
Large documents can cause memory issues.
Solutions:
// Limit body size
Document doc = Jsoup.connect("https://large-site.com")
.maxBodySize(1024 * 1024) // 1MB limit
.get();
// Process documents in streaming fashion
Connection connection = Jsoup.connect("https://example.com");
Connection.Response response = connection.execute();
if (response.statusCode() == 200) {
// Parse only specific parts
Document doc = response.parse();
Elements importantParts = doc.select("div.content");
// Clear unnecessary elements
doc.select("script, style, img").remove();
}
JVM Configuration:
# Increase heap size
java -Xmx2048m -XX:+UseG1GC YourApplication
# Monitor memory usage
java -XX:+PrintGCDetails -XX:+PrintGCTimeStamps YourApplication
Advanced Troubleshooting Techniques
8. Request Blocking and Anti-Bot Measures
Modern websites implement sophisticated blocking mechanisms.
Solutions:
// Realistic browser simulation
Document doc = Jsoup.connect("https://protected-site.com")
.userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36")
.header("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8")
.header("Accept-Language", "en-US,en;q=0.5")
.header("Accept-Encoding", "gzip, deflate")
.header("Connection", "keep-alive")
.referrer("https://google.com")
.get();
// Implement delays between requests
Thread.sleep(2000 + (int)(Math.random() * 3000)); // 2-5 second delay
// Session management with cookies
Map<String, String> cookies = new HashMap<>();
Connection.Response loginResponse = Jsoup.connect("https://site.com/login")
.data("username", "user")
.data("password", "pass")
.method(Connection.Method.POST)
.execute();
cookies.putAll(loginResponse.cookies());
Document protectedPage = Jsoup.connect("https://site.com/protected")
.cookies(cookies)
.get();
9. Encoding and Character Set Issues
Handle different character encodings properly.
Solutions:
// Specify encoding explicitly
Document doc = Jsoup.connect("https://international-site.com")
.header("Accept-Charset", "UTF-8")
.get();
// Parse with specific charset
String html = "..."; // HTML content as string
Document doc = Jsoup.parse(new ByteArrayInputStream(html.getBytes("UTF-8")),
"UTF-8", "https://example.com");
// Handle encoding detection
Connection.Response response = Jsoup.connect("https://example.com").execute();
String charset = response.charset();
if (charset == null) {
charset = "UTF-8"; // fallback
}
Document doc = response.parse();
Error Handling Best Practices
Comprehensive Error Handling Template
import org.jsoup.*;
import org.jsoup.nodes.Document;
import java.io.IOException;
import java.net.SocketTimeoutException;
public class RobustJsoupScraper {
public Document fetchDocument(String url, int maxRetries) {
int attempts = 0;
while (attempts < maxRetries) {
try {
return Jsoup.connect(url)
.timeout(30000)
.userAgent("Mozilla/5.0 (compatible; JavaScraper/1.0)")
.header("Accept", "text/html,application/xhtml+xml")
.followRedirects(true)
.ignoreHttpErrors(false)
.get();
} catch (SocketTimeoutException e) {
System.err.println("Timeout on attempt " + (attempts + 1) + ": " + e.getMessage());
} catch (HttpStatusException e) {
if (e.getStatusCode() == 429) {
// Rate limited - wait longer
try {
Thread.sleep(5000 * (attempts + 1));
} catch (InterruptedException ie) {
Thread.currentThread().interrupt();
break;
}
} else {
System.err.println("HTTP error " + e.getStatusCode() + ": " + e.getMessage());
break; // Don't retry on other HTTP errors
}
} catch (IOException e) {
System.err.println("Connection error: " + e.getMessage());
} catch (Exception e) {
System.err.println("Unexpected error: " + e.getMessage());
e.printStackTrace();
break;
}
attempts++;
// Exponential backoff
try {
Thread.sleep(1000 * attempts);
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
break;
}
}
return null; // All attempts failed
}
}
Debugging Tools and Techniques
Enable Detailed Logging
// Add to your application
System.setProperty("org.jsoup.debug", "true");
// Custom logging
Logger logger = LoggerFactory.getLogger(YourClass.class);
try {
Document doc = Jsoup.connect(url).get();
logger.info("Successfully fetched: {} (size: {} chars)", url, doc.html().length());
} catch (Exception e) {
logger.error("Failed to fetch: {}", url, e);
}
Document Analysis
// Analyze document structure
public void analyzeDocument(Document doc) {
System.out.println("Title: " + doc.title());
System.out.println("Base URI: " + doc.baseUri());
System.out.println("Total elements: " + doc.getAllElements().size());
System.out.println("Scripts: " + doc.select("script").size());
System.out.println("Stylesheets: " + doc.select("link[rel=stylesheet]").size());
System.out.println("Forms: " + doc.select("form").size());
// Check for common frameworks
if (doc.select("[ng-app], [data-ng-app]").size() > 0) {
System.out.println("Angular detected - content may be dynamic");
}
if (doc.select("[id^=react], [data-react]").size() > 0) {
System.out.println("React detected - content may be dynamic");
}
}
Prevention and Best Practices
- Always handle exceptions appropriately for your use case
- Implement retry logic with exponential backoff for transient failures
- Respect robots.txt and website terms of service
- Use appropriate delays between requests to avoid overwhelming servers
- Monitor your scraping for changes in website structure
- Keep jsoup updated to benefit from bug fixes and improvements
- Test selectors thoroughly before deploying to production
- Implement logging to track errors and performance
By following these troubleshooting techniques and best practices, you'll be able to handle most jsoup errors effectively and build robust web scraping applications.