How can I handle HTTP errors when scraping with jsoup?

When web scraping with jsoup, you'll inevitably encounter HTTP errors like 404 Not Found, 500 Internal Server Error, or 403 Forbidden. Proper error handling is crucial for building robust scrapers that can gracefully handle failures and continue operating reliably.

Understanding jsoup HTTP Exceptions

Jsoup throws different types of exceptions for various error scenarios:

HttpStatusException: Thrown for HTTP status codes (4xx, 5xx errors)
IOException: General I/O errors (network timeouts, connection failures)
IllegalArgumentException: Invalid URLs or malformed requests
UnsupportedMimeTypeException: Non-HTML content types

Basic Error Handling

Here's a fundamental example of handling HTTP errors with proper exception catching:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.HttpStatusException;
import org.jsoup.UnsupportedMimeTypeException;
import java.io.IOException;

public class BasicErrorHandling {
    public static void main(String[] args) {
        String url = "https://example.com/page";

        try {
            Document doc = Jsoup.connect(url)
                .timeout(10000)  // 10 second timeout
                .get();

            System.out.println("Title: " + doc.title());

        } catch (HttpStatusException e) {
            System.err.println("HTTP Error " + e.getStatusCode() + " for URL: " + e.getUrl());
            handleHttpError(e.getStatusCode(), e.getUrl());

        } catch (UnsupportedMimeTypeException e) {
            System.err.println("Unsupported content type: " + e.getMimeType());

        } catch (IOException e) {
            System.err.println("Network error: " + e.getMessage());

        } catch (IllegalArgumentException e) {
            System.err.println("Invalid URL: " + e.getMessage());
        }
    }

    private static void handleHttpError(int statusCode, String url) {
        switch (statusCode) {
            case 404:
                System.out.println("Page not found - check URL validity");
                break;
            case 403:
                System.out.println("Access forbidden - may need authentication");
                break;
            case 429:
                System.out.println("Rate limited - implement retry with backoff");
                break;
            case 500:
            case 502:
            case 503:
                System.out.println("Server error - retry may help");
                break;
            default:
                System.out.println("Unexpected HTTP error: " + statusCode);
        }
    }
}

Advanced Error Handling with Retry Logic

For production scrapers, implement retry mechanisms with exponential backoff:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.HttpStatusException;
import java.io.IOException;
import java.util.concurrent.TimeUnit;

public class RobustScraper {
    private static final int MAX_RETRIES = 3;
    private static final int BASE_DELAY_MS = 1000;

    public static Document scrapeWithRetry(String url) throws IOException {
        IOException lastException = null;

        for (int attempt = 0; attempt < MAX_RETRIES; attempt++) {
            try {
                return Jsoup.connect(url)
                    .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36")
                    .timeout(15000)
                    .followRedirects(true)
                    .ignoreHttpErrors(false)  // Let exceptions be thrown
                    .get();

            } catch (HttpStatusException e) {
                lastException = e;

                // Don't retry client errors (4xx), only server errors (5xx) and rate limits
                if (e.getStatusCode() >= 400 && e.getStatusCode() < 500 && e.getStatusCode() != 429) {
                    throw e;  // Don't retry 4xx errors (except 429)
                }

                System.out.println("Attempt " + (attempt + 1) + " failed with status " + 
                    e.getStatusCode() + ". Retrying...");

            } catch (IOException e) {
                lastException = e;
                System.out.println("Attempt " + (attempt + 1) + " failed: " + e.getMessage());
            }

            // Wait before retrying (exponential backoff)
            if (attempt < MAX_RETRIES - 1) {
                try {
                    long delay = BASE_DELAY_MS * (long) Math.pow(2, attempt);
                    TimeUnit.MILLISECONDS.sleep(delay);
                } catch (InterruptedException ie) {
                    Thread.currentThread().interrupt();
                    throw new IOException("Interrupted during retry delay", ie);
                }
            }
        }

        throw lastException;  // All retries failed
    }
}

Handling Specific HTTP Status Codes

Different status codes require different strategies:

public class StatusCodeHandler {

    public static Document handleSpecificErrors(String url) {
        try {
            return Jsoup.connect(url).get();

        } catch (HttpStatusException e) {
            switch (e.getStatusCode()) {
                case 301:
                case 302:
                    // Handle redirects manually if needed
                    String redirectUrl = e.getUrl();  // jsoup usually handles this automatically
                    System.out.println("Redirected to: " + redirectUrl);
                    break;

                case 401:
                    System.err.println("Authentication required");
                    // Implement login logic here
                    break;

                case 403:
                    System.err.println("Access forbidden - try different User-Agent or headers");
                    return retryWithDifferentHeaders(url);

                case 404:
                    System.err.println("Page not found");
                    return null;  // Or handle gracefully

                case 429:
                    System.err.println("Rate limited - implementing backoff");
                    return handleRateLimit(url);

                case 503:
                    System.err.println("Service temporarily unavailable");
                    // Implement exponential backoff
                    break;
            }

        } catch (IOException e) {
            System.err.println("Connection error: " + e.getMessage());
        }

        return null;
    }

    private static Document retryWithDifferentHeaders(String url) throws IOException {
        return Jsoup.connect(url)
            .userAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36")
            .header("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8")
            .header("Accept-Language", "en-US,en;q=0.5")
            .header("Accept-Encoding", "gzip, deflate")
            .header("DNT", "1")
            .header("Connection", "keep-alive")
            .header("Upgrade-Insecure-Requests", "1")
            .get();
    }

    private static Document handleRateLimit(String url) throws IOException {
        try {
            // Wait longer for rate limits
            TimeUnit.SECONDS.sleep(30);
            return Jsoup.connect(url).get();
        } catch (InterruptedException e) {
            Thread.currentThread().interrupt();
            throw new IOException("Interrupted during rate limit wait", e);
        }
    }
}

Bulk Scraping with Error Handling

When scraping multiple URLs, handle errors for individual URLs without stopping the entire process:

import java.util.*;
import java.util.concurrent.ConcurrentHashMap;

public class BulkScraper {

    public static Map<String, ScrapingResult> scrapeUrls(List<String> urls) {
        Map<String, ScrapingResult> results = new ConcurrentHashMap<>();

        for (String url : urls) {
            try {
                Document doc = scrapeWithRetry(url);
                results.put(url, new ScrapingResult(doc, null));

                // Add delay between requests to be respectful
                Thread.sleep(1000);

            } catch (Exception e) {
                results.put(url, new ScrapingResult(null, e));
                System.err.println("Failed to scrape " + url + ": " + e.getMessage());
            }
        }

        return results;
    }

    static class ScrapingResult {
        final Document document;
        final Exception error;

        ScrapingResult(Document document, Exception error) {
            this.document = document;
            this.error = error;
        }

        boolean isSuccess() {
            return document != null && error == null;
        }
    }
}

Best Practices for Robust Error Handling

Use Proper Timeouts: Set reasonable connection and read timeouts to prevent hanging
Implement Retry Logic: Retry on temporary failures (5xx errors, network issues) but not on client errors (4xx)
Rate Limiting: Respect server limits and implement delays between requests
User-Agent Rotation: Use realistic User-Agent strings to avoid being blocked
Log Everything: Maintain detailed logs for debugging and monitoring
Graceful Degradation: Continue processing other URLs even if some fail
Respect robots.txt: Check and follow website scraping policies

Error Prevention Strategies

// Configure jsoup connection for better reliability
Document doc = Jsoup.connect(url)
    .userAgent("Mozilla/5.0 (compatible; WebScraper/1.0)")
    .timeout(15000)
    .maxBodySize(1024 * 1024)  // 1MB limit
    .followRedirects(true)
    .ignoreContentType(false)
    .ignoreHttpErrors(false)
    .validateTLSCertificates(true)
    .get();

By implementing comprehensive error handling, your jsoup web scrapers will be more reliable, maintainable, and respectful of target websites. Always ensure your scraping activities comply with website terms of service and applicable laws.

Table of contents

How can I handle HTTP errors when scraping with jsoup?

Understanding jsoup HTTP Exceptions

Basic Error Handling

Advanced Error Handling with Retry Logic

Handling Specific HTTP Status Codes

Bulk Scraping with Error Handling

Best Practices for Robust Error Handling

Error Prevention Strategies

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Related Questions

How do I set custom HTTP headers with jsoup?

Can jsoup handle cookies while scraping?

How do I manage sessions and authentication with jsoup?

Get Started Now

Support

Support