What is the performance of jsoup when scraping large websites?

Jsoup is a high-performance Java library for HTML parsing and web scraping. When scraping large websites, understanding its performance characteristics is crucial for building efficient and scalable scrapers.

Performance Overview

Jsoup excels at HTML parsing speed, typically processing 1-2MB HTML documents in 50-100ms on modern hardware. However, performance varies significantly based on several factors:

Key Performance Factors

1. HTML Parsing Speed

Jsoup uses an optimized HTML5-compliant parser that handles malformed HTML gracefully. Parse time scales linearly with document size:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class PerformanceTest {
    public static void main(String[] args) {
        long startTime = System.currentTimeMillis();

        try {
            Document doc = Jsoup.connect("https://example.com")
                .timeout(30000)
                .get();

            long parseTime = System.currentTimeMillis() - startTime;
            System.out.println("Parse time: " + parseTime + "ms");
            System.out.println("Document size: " + doc.html().length() + " chars");
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

2. Memory Usage Optimization

Jsoup loads entire documents into memory as DOM trees. For large documents (>10MB), this can consume significant RAM:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.parser.Parser;

// Memory-efficient approach for large documents
public class MemoryOptimizedScraping {
    public static void extractData(String url) {
        try {
            // Configure connection with limits
            Document doc = Jsoup.connect(url)
                .maxBodySize(5 * 1024 * 1024) // Limit to 5MB
                .timeout(30000)
                .get();

            // Extract only needed data immediately
            Elements targetData = doc.select("div.content");

            // Process and clear references
            processData(targetData);
            doc = null; // Help GC

        } catch (Exception e) {
            System.err.println("Error processing: " + url);
        }
    }

    private static void processData(Elements elements) {
        for (Element element : elements) {
            System.out.println(element.text());
        }
    }
}

3. Selector Performance

CSS selector complexity significantly impacts traversal speed:

// Fast selectors
Elements fast1 = doc.select("div.product");           // Class selector
Elements fast2 = doc.select("#main-content");         // ID selector
Elements fast3 = doc.select("article > h2");          // Direct child

// Slower selectors
Elements slow1 = doc.select("div:contains(product)"); // Text contains
Elements slow2 = doc.select("*[data-id]");           // Universal with attribute
Elements slow3 = doc.select("div div div span");     // Deep traversal

Large-Scale Scraping Implementation

Concurrent Processing with Thread Pool

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.util.concurrent.*;
import java.util.List;
import java.util.ArrayList;

public class ConcurrentJsoupScraper {
    private final ExecutorService executor;
    private final int threadCount;

    public ConcurrentJsoupScraper(int threadCount) {
        this.threadCount = threadCount;
        this.executor = Executors.newFixedThreadPool(threadCount);
    }

    public void scrapeUrls(List<String> urls) {
        List<Future<String>> futures = new ArrayList<>();

        for (String url : urls) {
            Future<String> future = executor.submit(() -> {
                try {
                    // Add delay to respect rate limits
                    Thread.sleep(1000);

                    Document doc = Jsoup.connect(url)
                        .userAgent("Mozilla/5.0 (compatible; Bot/1.0)")
                        .timeout(15000)
                        .followRedirects(true)
                        .get();

                    return extractContent(doc);
                } catch (Exception e) {
                    System.err.println("Failed to scrape: " + url);
                    return null;
                }
            });
            futures.add(future);
        }

        // Collect results
        for (Future<String> future : futures) {
            try {
                String result = future.get(30, TimeUnit.SECONDS);
                if (result != null) {
                    System.out.println(result);
                }
            } catch (Exception e) {
                System.err.println("Task failed: " + e.getMessage());
            }
        }
    }

    private String extractContent(Document doc) {
        return doc.select("h1, h2, p").text();
    }

    public void shutdown() {
        executor.shutdown();
    }
}

Connection Pool and Configuration

import org.jsoup.Connection;
import org.jsoup.Jsoup;

public class OptimizedJsoupClient {
    private static final String USER_AGENT = 
        "Mozilla/5.0 (compatible; WebScraper/1.0)";

    public static Connection.Response fetchWithRetry(String url, int maxRetries) {
        for (int attempt = 0; attempt < maxRetries; attempt++) {
            try {
                Connection.Response response = Jsoup.connect(url)
                    .userAgent(USER_AGENT)
                    .timeout(30000)
                    .followRedirects(true)
                    .validateTLSCertificates(false) // For development only
                    .header("Accept", "text/html,application/xhtml+xml")
                    .header("Accept-Language", "en-US,en;q=0.5")
                    .header("Accept-Encoding", "gzip, deflate")
                    .header("Connection", "keep-alive")
                    .execute();

                if (response.statusCode() == 200) {
                    return response;
                }
            } catch (Exception e) {
                System.err.println("Attempt " + (attempt + 1) + " failed: " + e.getMessage());

                if (attempt < maxRetries - 1) {
                    try {
                        Thread.sleep(2000 * (attempt + 1)); // Exponential backoff
                    } catch (InterruptedException ie) {
                        Thread.currentThread().interrupt();
                        break;
                    }
                }
            }
        }
        return null;
    }
}

Performance Optimization Strategies

1. Efficient Data Extraction

public class EfficientExtraction {
    public static void extractProductData(Document doc) {
        // Use specific selectors instead of broad searches
        Elements products = doc.select("div.product-item");

        for (Element product : products) {
            // Extract data in one pass
            String name = product.selectFirst("h3.title")?.text();
            String price = product.selectFirst("span.price")?.text();
            String image = product.selectFirst("img")?.attr("src");

            // Process immediately instead of storing all in memory
            processProduct(name, price, image);
        }
    }

    private static void processProduct(String name, String price, String image) {
        // Save to database, file, or API
        System.out.println("Product: " + name + " - " + price);
    }
}

2. Memory Management

// Configure JVM for large-scale scraping
// -Xmx4g -XX:+UseG1GC -XX:MaxGCPauseMillis=200

public class MemoryEfficientScraper {
    public void processBatch(List<String> urls) {
        for (String url : urls) {
            try {
                Document doc = Jsoup.connect(url)
                    .maxBodySize(2 * 1024 * 1024) // 2MB limit
                    .get();

                // Process immediately
                processDocument(doc);

                // Clear reference to help GC
                doc = null;

                // Periodic GC hint for large batches
                if (urls.indexOf(url) % 100 == 0) {
                    System.gc();
                }

            } catch (Exception e) {
                System.err.println("Error: " + e.getMessage());
            }
        }
    }
}

Performance Benchmarks

Based on typical usage patterns:

Small pages (< 100KB): 10-50ms parse time
Medium pages (100KB-1MB): 50-200ms parse time
Large pages (1-5MB): 200-1000ms parse time
Memory usage: ~3-5x document size in RAM
Concurrent throughput: 50-200 pages/second (depending on network and site)

Best Practices for Large Websites

Implement Rate Limiting

   // Use libraries like Guava RateLimiter
   RateLimiter rateLimiter = RateLimiter.create(2.0); // 2 requests per second
   rateLimiter.acquire(); // Before each request

Handle Errors Gracefully

   try {
       Document doc = Jsoup.connect(url).get();
   } catch (HttpStatusException e) {
       if (e.getStatusCode() == 429) {
           // Rate limited - increase delay
           Thread.sleep(5000);
       }
   } catch (SocketTimeoutException e) {
       // Increase timeout or retry
   }

Monitor Performance

   long startTime = System.nanoTime();
   Document doc = Jsoup.connect(url).get();
   long duration = (System.nanoTime() - startTime) / 1_000_000;
   System.out.println("Parse time: " + duration + "ms");

Use Connection Pooling for high-volume scraping
Implement Caching to avoid re-processing identical content
Respect robots.txt and implement proper delays
Use Proxies to distribute load and avoid IP blocking

Jsoup provides excellent performance for most web scraping tasks, but success with large websites requires careful attention to concurrency, memory management, and respectful scraping practices.

Table of contents

What is the performance of jsoup when scraping large websites?

Performance Overview

Key Performance Factors

1. HTML Parsing Speed

2. Memory Usage Optimization

3. Selector Performance

Large-Scale Scraping Implementation

Concurrent Processing with Thread Pool

Connection Pool and Configuration

Performance Optimization Strategies

1. Efficient Data Extraction

2. Memory Management

Performance Benchmarks

Best Practices for Large Websites

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Related Questions

How do I use jsoup in an Android application?

How do I handle redirects when scraping with jsoup?

Is it possible to scrape content behind a login with jsoup?

Get Started Now