Table of contents

How do I handle time-outs and retries with jsoup connections?

Handling timeouts and retries with jsoup is crucial for building robust web scraping applications. This guide covers how to configure timeouts, implement retry mechanisms, and handle connection failures gracefully.

Understanding Timeouts in jsoup

Jsoup provides several timeout configurations to control connection behavior:

  • Connection timeout: Time to establish connection
  • Read timeout: Time to read data from server
  • Total timeout: Overall request duration limit

Basic Timeout Configuration

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.net.SocketTimeoutException;
import java.io.IOException;

public class JsoupTimeoutExample {
    public static void main(String[] args) {
        try {
            Document doc = Jsoup.connect("https://example.com")
                    .timeout(10000)          // Total timeout: 10 seconds
                    .userAgent("Mozilla/5.0") // Always set user agent
                    .get();

            System.out.println("Title: " + doc.title());
        } catch (SocketTimeoutException e) {
            System.err.println("Request timed out: " + e.getMessage());
        } catch (IOException e) {
            System.err.println("Connection failed: " + e.getMessage());
        }
    }
}

Advanced Timeout Settings

Document doc = Jsoup.connect("https://example.com")
        .timeout(0)                    // No timeout (infinite)
        .maxBodySize(1024 * 1024)     // 1MB max response size
        .followRedirects(true)        // Follow redirects
        .ignoreHttpErrors(true)       // Don't throw on HTTP errors
        .get();

Implementing Retry Logic

Simple Retry with Linear Backoff

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.io.IOException;

public class SimpleRetryExample {
    private static final int MAX_RETRIES = 3;
    private static final int RETRY_DELAY = 2000; // 2 seconds

    public static Document fetchWithRetry(String url) throws IOException {
        IOException lastException = null;

        for (int attempt = 1; attempt <= MAX_RETRIES; attempt++) {
            try {
                return Jsoup.connect(url)
                        .timeout(10000)
                        .userAgent("Mozilla/5.0 (compatible; JavaBot/1.0)")
                        .get();

            } catch (IOException e) {
                lastException = e;
                System.out.printf("Attempt %d failed: %s%n", attempt, e.getMessage());

                if (attempt < MAX_RETRIES) {
                    try {
                        Thread.sleep(RETRY_DELAY);
                    } catch (InterruptedException ie) {
                        Thread.currentThread().interrupt();
                        throw new IOException("Retry interrupted", ie);
                    }
                }
            }
        }

        throw new IOException("Failed after " + MAX_RETRIES + " attempts", lastException);
    }
}

Advanced Retry with Exponential Backoff

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.io.IOException;
import java.net.SocketTimeoutException;
import java.net.ConnectException;
import java.util.Random;

public class ExponentialBackoffRetry {
    private static final int MAX_RETRIES = 5;
    private static final long BASE_DELAY = 1000; // 1 second
    private static final Random random = new Random();

    public static Document fetchWithExponentialBackoff(String url) throws IOException {
        IOException lastException = null;

        for (int attempt = 1; attempt <= MAX_RETRIES; attempt++) {
            try {
                return Jsoup.connect(url)
                        .timeout(15000)
                        .userAgent("Mozilla/5.0 (compatible; JavaBot/1.0)")
                        .header("Accept", "text/html,application/xhtml+xml")
                        .get();

            } catch (SocketTimeoutException | ConnectException e) {
                lastException = e;
                System.out.printf("Attempt %d failed (%s): %s%n", 
                    attempt, e.getClass().getSimpleName(), e.getMessage());

                if (attempt < MAX_RETRIES) {
                    long delay = calculateBackoffDelay(attempt);
                    System.out.printf("Retrying in %d ms...%n", delay);

                    try {
                        Thread.sleep(delay);
                    } catch (InterruptedException ie) {
                        Thread.currentThread().interrupt();
                        throw new IOException("Retry interrupted", ie);
                    }
                }
            } catch (IOException e) {
                // Don't retry on other IO errors (404, 403, etc.)
                throw e;
            }
        }

        throw new IOException("Failed after " + MAX_RETRIES + " attempts", lastException);
    }

    private static long calculateBackoffDelay(int attempt) {
        // Exponential backoff with jitter: 2^attempt * base + random(0-1000)
        long exponentialDelay = (long) (BASE_DELAY * Math.pow(2, attempt - 1));
        long jitter = random.nextInt(1000);
        return exponentialDelay + jitter;
    }
}

Production-Ready Retry Service

import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.io.IOException;
import java.net.SocketTimeoutException;
import java.net.ConnectException;
import java.net.UnknownHostException;
import java.util.Set;
import java.util.concurrent.TimeUnit;

public class RobustJsoupClient {
    private static final int MAX_RETRIES = 3;
    private static final long BASE_DELAY_MS = 1000;
    private static final int TIMEOUT_MS = 15000;

    // HTTP status codes that should trigger retries
    private static final Set<Integer> RETRYABLE_STATUS_CODES = Set.of(
        408, 429, 500, 502, 503, 504
    );

    // Exception types that should trigger retries
    private static final Set<Class<? extends Exception>> RETRYABLE_EXCEPTIONS = Set.of(
        SocketTimeoutException.class,
        ConnectException.class,
        UnknownHostException.class
    );

    public Document fetch(String url) throws IOException {
        return fetch(url, MAX_RETRIES);
    }

    public Document fetch(String url, int maxRetries) throws IOException {
        IOException lastException = null;

        for (int attempt = 1; attempt <= maxRetries; attempt++) {
            try {
                Connection.Response response = Jsoup.connect(url)
                        .timeout(TIMEOUT_MS)
                        .userAgent("Mozilla/5.0 (compatible; JavaBot/1.0)")
                        .header("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8")
                        .header("Accept-Language", "en-US,en;q=0.5")
                        .header("Accept-Encoding", "gzip, deflate")
                        .header("Connection", "keep-alive")
                        .execute();

                // Check if status code is retryable
                int statusCode = response.statusCode();
                if (RETRYABLE_STATUS_CODES.contains(statusCode)) {
                    throw new IOException("Retryable HTTP status: " + statusCode);
                }

                return response.parse();

            } catch (IOException e) {
                lastException = e;

                boolean shouldRetry = shouldRetryException(e) && attempt < maxRetries;

                System.out.printf("Attempt %d failed: %s%s%n", 
                    attempt, e.getMessage(), 
                    shouldRetry ? " - retrying..." : " - giving up");

                if (shouldRetry) {
                    waitBeforeRetry(attempt);
                } else {
                    break;
                }
            }
        }

        throw new IOException(
            String.format("Failed to fetch %s after %d attempts", url, maxRetries), 
            lastException
        );
    }

    private boolean shouldRetryException(IOException e) {
        return RETRYABLE_EXCEPTIONS.stream()
                .anyMatch(retryableClass -> retryableClass.isInstance(e));
    }

    private void waitBeforeRetry(int attempt) {
        try {
            long delay = BASE_DELAY_MS * (1L << (attempt - 1)); // 2^(attempt-1)
            TimeUnit.MILLISECONDS.sleep(delay);
        } catch (InterruptedException e) {
            Thread.currentThread().interrupt();
            throw new RuntimeException("Retry interrupted", e);
        }
    }
}

Usage Examples

public class Main {
    public static void main(String[] args) {
        RobustJsoupClient client = new RobustJsoupClient();

        try {
            Document doc = client.fetch("https://example.com");
            System.out.println("Successfully fetched: " + doc.title());

            // Extract data
            doc.select("h1").forEach(h1 -> 
                System.out.println("Heading: " + h1.text())
            );

        } catch (IOException e) {
            System.err.println("Failed to fetch page: " + e.getMessage());
        }
    }
}

Best Practices

  1. Set appropriate timeouts: 10-30 seconds for most use cases
  2. Use exponential backoff: Prevents overwhelming servers
  3. Add jitter: Reduces thundering herd problems
  4. Limit retry attempts: 3-5 retries maximum
  5. Set User-Agent: Many sites block requests without proper headers
  6. Handle specific exceptions: Only retry transient failures
  7. Respect rate limits: Add delays between requests
  8. Monitor retry rates: High retry rates indicate systemic issues

Common Timeout Values

  • Fast APIs: 5-10 seconds
  • Regular websites: 15-30 seconds
  • Slow/heavy pages: 30-60 seconds
  • File downloads: No timeout or very high (300+ seconds)

Remember to adjust timeout and retry settings based on your specific use case and the characteristics of the websites you're scraping.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon