Table of contents

What is the performance of jsoup when scraping large websites?

Jsoup is a high-performance Java library for HTML parsing and web scraping. When scraping large websites, understanding its performance characteristics is crucial for building efficient and scalable scrapers.

Performance Overview

Jsoup excels at HTML parsing speed, typically processing 1-2MB HTML documents in 50-100ms on modern hardware. However, performance varies significantly based on several factors:

Key Performance Factors

1. HTML Parsing Speed

Jsoup uses an optimized HTML5-compliant parser that handles malformed HTML gracefully. Parse time scales linearly with document size:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class PerformanceTest {
    public static void main(String[] args) {
        long startTime = System.currentTimeMillis();

        try {
            Document doc = Jsoup.connect("https://example.com")
                .timeout(30000)
                .get();

            long parseTime = System.currentTimeMillis() - startTime;
            System.out.println("Parse time: " + parseTime + "ms");
            System.out.println("Document size: " + doc.html().length() + " chars");
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

2. Memory Usage Optimization

Jsoup loads entire documents into memory as DOM trees. For large documents (>10MB), this can consume significant RAM:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.parser.Parser;

// Memory-efficient approach for large documents
public class MemoryOptimizedScraping {
    public static void extractData(String url) {
        try {
            // Configure connection with limits
            Document doc = Jsoup.connect(url)
                .maxBodySize(5 * 1024 * 1024) // Limit to 5MB
                .timeout(30000)
                .get();

            // Extract only needed data immediately
            Elements targetData = doc.select("div.content");

            // Process and clear references
            processData(targetData);
            doc = null; // Help GC

        } catch (Exception e) {
            System.err.println("Error processing: " + url);
        }
    }

    private static void processData(Elements elements) {
        for (Element element : elements) {
            System.out.println(element.text());
        }
    }
}

3. Selector Performance

CSS selector complexity significantly impacts traversal speed:

// Fast selectors
Elements fast1 = doc.select("div.product");           // Class selector
Elements fast2 = doc.select("#main-content");         // ID selector
Elements fast3 = doc.select("article > h2");          // Direct child

// Slower selectors
Elements slow1 = doc.select("div:contains(product)"); // Text contains
Elements slow2 = doc.select("*[data-id]");           // Universal with attribute
Elements slow3 = doc.select("div div div span");     // Deep traversal

Large-Scale Scraping Implementation

Concurrent Processing with Thread Pool

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.util.concurrent.*;
import java.util.List;
import java.util.ArrayList;

public class ConcurrentJsoupScraper {
    private final ExecutorService executor;
    private final int threadCount;

    public ConcurrentJsoupScraper(int threadCount) {
        this.threadCount = threadCount;
        this.executor = Executors.newFixedThreadPool(threadCount);
    }

    public void scrapeUrls(List<String> urls) {
        List<Future<String>> futures = new ArrayList<>();

        for (String url : urls) {
            Future<String> future = executor.submit(() -> {
                try {
                    // Add delay to respect rate limits
                    Thread.sleep(1000);

                    Document doc = Jsoup.connect(url)
                        .userAgent("Mozilla/5.0 (compatible; Bot/1.0)")
                        .timeout(15000)
                        .followRedirects(true)
                        .get();

                    return extractContent(doc);
                } catch (Exception e) {
                    System.err.println("Failed to scrape: " + url);
                    return null;
                }
            });
            futures.add(future);
        }

        // Collect results
        for (Future<String> future : futures) {
            try {
                String result = future.get(30, TimeUnit.SECONDS);
                if (result != null) {
                    System.out.println(result);
                }
            } catch (Exception e) {
                System.err.println("Task failed: " + e.getMessage());
            }
        }
    }

    private String extractContent(Document doc) {
        return doc.select("h1, h2, p").text();
    }

    public void shutdown() {
        executor.shutdown();
    }
}

Connection Pool and Configuration

import org.jsoup.Connection;
import org.jsoup.Jsoup;

public class OptimizedJsoupClient {
    private static final String USER_AGENT = 
        "Mozilla/5.0 (compatible; WebScraper/1.0)";

    public static Connection.Response fetchWithRetry(String url, int maxRetries) {
        for (int attempt = 0; attempt < maxRetries; attempt++) {
            try {
                Connection.Response response = Jsoup.connect(url)
                    .userAgent(USER_AGENT)
                    .timeout(30000)
                    .followRedirects(true)
                    .validateTLSCertificates(false) // For development only
                    .header("Accept", "text/html,application/xhtml+xml")
                    .header("Accept-Language", "en-US,en;q=0.5")
                    .header("Accept-Encoding", "gzip, deflate")
                    .header("Connection", "keep-alive")
                    .execute();

                if (response.statusCode() == 200) {
                    return response;
                }
            } catch (Exception e) {
                System.err.println("Attempt " + (attempt + 1) + " failed: " + e.getMessage());

                if (attempt < maxRetries - 1) {
                    try {
                        Thread.sleep(2000 * (attempt + 1)); // Exponential backoff
                    } catch (InterruptedException ie) {
                        Thread.currentThread().interrupt();
                        break;
                    }
                }
            }
        }
        return null;
    }
}

Performance Optimization Strategies

1. Efficient Data Extraction

public class EfficientExtraction {
    public static void extractProductData(Document doc) {
        // Use specific selectors instead of broad searches
        Elements products = doc.select("div.product-item");

        for (Element product : products) {
            // Extract data in one pass
            String name = product.selectFirst("h3.title")?.text();
            String price = product.selectFirst("span.price")?.text();
            String image = product.selectFirst("img")?.attr("src");

            // Process immediately instead of storing all in memory
            processProduct(name, price, image);
        }
    }

    private static void processProduct(String name, String price, String image) {
        // Save to database, file, or API
        System.out.println("Product: " + name + " - " + price);
    }
}

2. Memory Management

// Configure JVM for large-scale scraping
// -Xmx4g -XX:+UseG1GC -XX:MaxGCPauseMillis=200

public class MemoryEfficientScraper {
    public void processBatch(List<String> urls) {
        for (String url : urls) {
            try {
                Document doc = Jsoup.connect(url)
                    .maxBodySize(2 * 1024 * 1024) // 2MB limit
                    .get();

                // Process immediately
                processDocument(doc);

                // Clear reference to help GC
                doc = null;

                // Periodic GC hint for large batches
                if (urls.indexOf(url) % 100 == 0) {
                    System.gc();
                }

            } catch (Exception e) {
                System.err.println("Error: " + e.getMessage());
            }
        }
    }
}

Performance Benchmarks

Based on typical usage patterns:

  • Small pages (< 100KB): 10-50ms parse time
  • Medium pages (100KB-1MB): 50-200ms parse time
  • Large pages (1-5MB): 200-1000ms parse time
  • Memory usage: ~3-5x document size in RAM
  • Concurrent throughput: 50-200 pages/second (depending on network and site)

Best Practices for Large Websites

  1. Implement Rate Limiting
   // Use libraries like Guava RateLimiter
   RateLimiter rateLimiter = RateLimiter.create(2.0); // 2 requests per second
   rateLimiter.acquire(); // Before each request
  1. Handle Errors Gracefully
   try {
       Document doc = Jsoup.connect(url).get();
   } catch (HttpStatusException e) {
       if (e.getStatusCode() == 429) {
           // Rate limited - increase delay
           Thread.sleep(5000);
       }
   } catch (SocketTimeoutException e) {
       // Increase timeout or retry
   }
  1. Monitor Performance
   long startTime = System.nanoTime();
   Document doc = Jsoup.connect(url).get();
   long duration = (System.nanoTime() - startTime) / 1_000_000;
   System.out.println("Parse time: " + duration + "ms");
  1. Use Connection Pooling for high-volume scraping
  2. Implement Caching to avoid re-processing identical content
  3. Respect robots.txt and implement proper delays
  4. Use Proxies to distribute load and avoid IP blocking

Jsoup provides excellent performance for most web scraping tasks, but success with large websites requires careful attention to concurrency, memory management, and respectful scraping practices.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon