When web scraping with jsoup, you'll inevitably encounter HTTP errors like 404 Not Found
, 500 Internal Server Error
, or 403 Forbidden
. Proper error handling is crucial for building robust scrapers that can gracefully handle failures and continue operating reliably.
Understanding jsoup HTTP Exceptions
Jsoup throws different types of exceptions for various error scenarios:
HttpStatusException
: Thrown for HTTP status codes (4xx, 5xx errors)IOException
: General I/O errors (network timeouts, connection failures)IllegalArgumentException
: Invalid URLs or malformed requestsUnsupportedMimeTypeException
: Non-HTML content types
Basic Error Handling
Here's a fundamental example of handling HTTP errors with proper exception catching:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.HttpStatusException;
import org.jsoup.UnsupportedMimeTypeException;
import java.io.IOException;
public class BasicErrorHandling {
public static void main(String[] args) {
String url = "https://example.com/page";
try {
Document doc = Jsoup.connect(url)
.timeout(10000) // 10 second timeout
.get();
System.out.println("Title: " + doc.title());
} catch (HttpStatusException e) {
System.err.println("HTTP Error " + e.getStatusCode() + " for URL: " + e.getUrl());
handleHttpError(e.getStatusCode(), e.getUrl());
} catch (UnsupportedMimeTypeException e) {
System.err.println("Unsupported content type: " + e.getMimeType());
} catch (IOException e) {
System.err.println("Network error: " + e.getMessage());
} catch (IllegalArgumentException e) {
System.err.println("Invalid URL: " + e.getMessage());
}
}
private static void handleHttpError(int statusCode, String url) {
switch (statusCode) {
case 404:
System.out.println("Page not found - check URL validity");
break;
case 403:
System.out.println("Access forbidden - may need authentication");
break;
case 429:
System.out.println("Rate limited - implement retry with backoff");
break;
case 500:
case 502:
case 503:
System.out.println("Server error - retry may help");
break;
default:
System.out.println("Unexpected HTTP error: " + statusCode);
}
}
}
Advanced Error Handling with Retry Logic
For production scrapers, implement retry mechanisms with exponential backoff:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.HttpStatusException;
import java.io.IOException;
import java.util.concurrent.TimeUnit;
public class RobustScraper {
private static final int MAX_RETRIES = 3;
private static final int BASE_DELAY_MS = 1000;
public static Document scrapeWithRetry(String url) throws IOException {
IOException lastException = null;
for (int attempt = 0; attempt < MAX_RETRIES; attempt++) {
try {
return Jsoup.connect(url)
.userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36")
.timeout(15000)
.followRedirects(true)
.ignoreHttpErrors(false) // Let exceptions be thrown
.get();
} catch (HttpStatusException e) {
lastException = e;
// Don't retry client errors (4xx), only server errors (5xx) and rate limits
if (e.getStatusCode() >= 400 && e.getStatusCode() < 500 && e.getStatusCode() != 429) {
throw e; // Don't retry 4xx errors (except 429)
}
System.out.println("Attempt " + (attempt + 1) + " failed with status " +
e.getStatusCode() + ". Retrying...");
} catch (IOException e) {
lastException = e;
System.out.println("Attempt " + (attempt + 1) + " failed: " + e.getMessage());
}
// Wait before retrying (exponential backoff)
if (attempt < MAX_RETRIES - 1) {
try {
long delay = BASE_DELAY_MS * (long) Math.pow(2, attempt);
TimeUnit.MILLISECONDS.sleep(delay);
} catch (InterruptedException ie) {
Thread.currentThread().interrupt();
throw new IOException("Interrupted during retry delay", ie);
}
}
}
throw lastException; // All retries failed
}
}
Handling Specific HTTP Status Codes
Different status codes require different strategies:
public class StatusCodeHandler {
public static Document handleSpecificErrors(String url) {
try {
return Jsoup.connect(url).get();
} catch (HttpStatusException e) {
switch (e.getStatusCode()) {
case 301:
case 302:
// Handle redirects manually if needed
String redirectUrl = e.getUrl(); // jsoup usually handles this automatically
System.out.println("Redirected to: " + redirectUrl);
break;
case 401:
System.err.println("Authentication required");
// Implement login logic here
break;
case 403:
System.err.println("Access forbidden - try different User-Agent or headers");
return retryWithDifferentHeaders(url);
case 404:
System.err.println("Page not found");
return null; // Or handle gracefully
case 429:
System.err.println("Rate limited - implementing backoff");
return handleRateLimit(url);
case 503:
System.err.println("Service temporarily unavailable");
// Implement exponential backoff
break;
}
} catch (IOException e) {
System.err.println("Connection error: " + e.getMessage());
}
return null;
}
private static Document retryWithDifferentHeaders(String url) throws IOException {
return Jsoup.connect(url)
.userAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36")
.header("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8")
.header("Accept-Language", "en-US,en;q=0.5")
.header("Accept-Encoding", "gzip, deflate")
.header("DNT", "1")
.header("Connection", "keep-alive")
.header("Upgrade-Insecure-Requests", "1")
.get();
}
private static Document handleRateLimit(String url) throws IOException {
try {
// Wait longer for rate limits
TimeUnit.SECONDS.sleep(30);
return Jsoup.connect(url).get();
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
throw new IOException("Interrupted during rate limit wait", e);
}
}
}
Bulk Scraping with Error Handling
When scraping multiple URLs, handle errors for individual URLs without stopping the entire process:
import java.util.*;
import java.util.concurrent.ConcurrentHashMap;
public class BulkScraper {
public static Map<String, ScrapingResult> scrapeUrls(List<String> urls) {
Map<String, ScrapingResult> results = new ConcurrentHashMap<>();
for (String url : urls) {
try {
Document doc = scrapeWithRetry(url);
results.put(url, new ScrapingResult(doc, null));
// Add delay between requests to be respectful
Thread.sleep(1000);
} catch (Exception e) {
results.put(url, new ScrapingResult(null, e));
System.err.println("Failed to scrape " + url + ": " + e.getMessage());
}
}
return results;
}
static class ScrapingResult {
final Document document;
final Exception error;
ScrapingResult(Document document, Exception error) {
this.document = document;
this.error = error;
}
boolean isSuccess() {
return document != null && error == null;
}
}
}
Best Practices for Robust Error Handling
- Use Proper Timeouts: Set reasonable connection and read timeouts to prevent hanging
- Implement Retry Logic: Retry on temporary failures (5xx errors, network issues) but not on client errors (4xx)
- Rate Limiting: Respect server limits and implement delays between requests
- User-Agent Rotation: Use realistic User-Agent strings to avoid being blocked
- Log Everything: Maintain detailed logs for debugging and monitoring
- Graceful Degradation: Continue processing other URLs even if some fail
- Respect robots.txt: Check and follow website scraping policies
Error Prevention Strategies
// Configure jsoup connection for better reliability
Document doc = Jsoup.connect(url)
.userAgent("Mozilla/5.0 (compatible; WebScraper/1.0)")
.timeout(15000)
.maxBodySize(1024 * 1024) // 1MB limit
.followRedirects(true)
.ignoreContentType(false)
.ignoreHttpErrors(false)
.validateTLSCertificates(true)
.get();
By implementing comprehensive error handling, your jsoup web scrapers will be more reliable, maintainable, and respectful of target websites. Always ensure your scraping activities comply with website terms of service and applicable laws.