Table of contents

What is the maximum file size jsoup can handle?

Jsoup doesn't have a strict built-in maximum file size limit, but it's constrained by available heap memory and practical performance considerations. The actual limit depends on your JVM heap size, document complexity, and parsing requirements. Understanding these limitations and implementing proper optimization strategies is crucial for handling large HTML documents effectively.

Memory-Based Limitations

Jsoup loads the entire HTML document into memory as a DOM tree, which means the practical file size limit is determined by:

  • Available heap memory: Typically 25-30% of your JVM heap size
  • Document complexity: More nested elements consume more memory
  • Parser overhead: Jsoup's internal structures add memory overhead

Typical Size Guidelines

// Small documents (< 1MB): No issues
Document doc = Jsoup.connect("https://example.com/small-page.html").get();

// Medium documents (1-10MB): Usually manageable with default settings
Document doc = Jsoup.connect("https://example.com/medium-page.html")
    .maxBodySize(10 * 1024 * 1024) // 10MB limit
    .get();

// Large documents (10-100MB): Requires heap tuning
// JVM args: -Xmx2g -Xms1g
Document doc = Jsoup.connect("https://example.com/large-page.html")
    .maxBodySize(100 * 1024 * 1024) // 100MB limit
    .get();

Configuring Memory Limits

Setting Maximum Body Size

Jsoup provides a maxBodySize() method to prevent downloading excessively large documents:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.io.IOException;

public class LargeDocumentHandler {
    public static void main(String[] args) {
        try {
            // Set maximum download size to 50MB
            Document doc = Jsoup.connect("https://example.com/large-file.html")
                .maxBodySize(50 * 1024 * 1024) // 50MB
                .timeout(30000) // 30 second timeout
                .get();

            System.out.println("Document loaded successfully");
            System.out.println("Title: " + doc.title());

        } catch (IOException e) {
            System.err.println("Error loading document: " + e.getMessage());
        }
    }
}

JVM Heap Configuration

For processing large documents, configure appropriate JVM settings:

# Start your Java application with increased heap size
java -Xmx4g -Xms2g -XX:+UseG1GC YourJsoupApplication

# For very large documents (>100MB)
java -Xmx8g -Xms4g -XX:+UseG1GC -XX:MaxGCPauseMillis=200 YourApp

Handling Large Files Efficiently

Streaming Approach for Large Documents

When dealing with very large HTML files, consider streaming parsing instead of loading everything into memory:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.jsoup.Connection;
import java.io.*;
import java.util.zip.GZIPInputStream;

public class StreamingParser {

    public static Document parseFromFile(String filePath) throws IOException {
        try (FileInputStream fis = new FileInputStream(filePath);
             BufferedInputStream bis = new BufferedInputStream(fis)) {

            // For compressed files
            if (filePath.endsWith(".gz")) {
                try (GZIPInputStream gzis = new GZIPInputStream(bis)) {
                    return Jsoup.parse(gzis, "UTF-8", "");
                }
            }

            return Jsoup.parse(bis, "UTF-8", "");
        }
    }

    public static void processLargeDocument(String url) {
        try {
            // Download in chunks and process incrementally
            Connection connection = Jsoup.connect(url)
                .maxBodySize(0) // Unlimited download size
                .timeout(60000);

            Connection.Response response = connection.execute();

            if (response.contentLength() > 100 * 1024 * 1024) { // >100MB
                System.out.println("Warning: Large document detected");
                // Consider alternative processing approach
            }

            Document doc = response.parse();
            processDocumentInChunks(doc);

        } catch (IOException e) {
            System.err.println("Error: " + e.getMessage());
        }
    }

    private static void processDocumentInChunks(Document doc) {
        // Process elements in batches to reduce memory usage
        Elements allElements = doc.getAllElements();
        int batchSize = 1000;

        for (int i = 0; i < allElements.size(); i += batchSize) {
            int end = Math.min(i + batchSize, allElements.size());
            Elements batch = new Elements(allElements.subList(i, end));

            // Process this batch
            processBatch(batch);

            // Optional: Force garbage collection
            if (i % (batchSize * 10) == 0) {
                System.gc();
            }
        }
    }

    private static void processBatch(Elements batch) {
        // Your processing logic here
        for (Element element : batch) {
            // Extract required data
            String text = element.text();
            String tagName = element.tagName();
            // Process as needed
        }
    }
}

Memory-Efficient Parsing Strategies

import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class MemoryEfficientParser {

    public static void parseSelectiveContent(String url) throws IOException {
        Document doc = Jsoup.connect(url)
            .maxBodySize(20 * 1024 * 1024) // 20MB limit
            .get();

        // Extract only needed elements to reduce memory footprint
        Elements articles = doc.select("article, .content, main");

        // Remove unnecessary elements early
        doc.select("script, style, nav, footer").remove();

        // Process specific sections
        for (Element article : articles) {
            processArticle(article);
            // Clear processed content to free memory
            article.remove();
        }
    }

    private static void processArticle(Element article) {
        Element titleElement = article.select("h1, h2").first();
        String title = titleElement != null ? titleElement.text() : "";
        String content = article.select("p").text();

        // Process and store data
        System.out.println("Title: " + title);
        System.out.println("Content length: " + content.length());
    }
}

Alternative Approaches for Very Large Files

Using SAX Parser for Extremely Large Documents

For documents exceeding memory constraints, consider using SAX (Simple API for XML) parsing:

import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.xml.sax.DefaultHandler;
import org.xml.sax.Attributes;

public class SAXBasedParser extends DefaultHandler {
    private StringBuilder currentElement = new StringBuilder();
    private boolean inTargetElement = false;

    @Override
    public void startElement(String uri, String localName, 
                           String qName, Attributes attributes) {
        if ("div".equals(qName) && "content".equals(attributes.getValue("class"))) {
            inTargetElement = true;
            currentElement = new StringBuilder();
        }
    }

    @Override
    public void characters(char[] ch, int start, int length) {
        if (inTargetElement) {
            currentElement.append(ch, start, length);
        }
    }

    @Override
    public void endElement(String uri, String localName, String qName) {
        if (inTargetElement && "div".equals(qName)) {
            // Process the extracted content
            processContent(currentElement.toString());
            inTargetElement = false;
        }
    }

    private void processContent(String content) {
        // Handle extracted content
        System.out.println("Processed content: " + content.substring(0, 
            Math.min(100, content.length())) + "...");
    }

    public static void parseVeryLargeFile(String filePath) {
        try {
            SAXParserFactory factory = SAXParserFactory.newInstance();
            SAXParser parser = factory.newSAXParser();
            parser.parse(filePath, new SAXBasedParser());
        } catch (Exception e) {
            System.err.println("SAX parsing error: " + e.getMessage());
        }
    }
}

Performance Monitoring and Optimization

Memory Usage Monitoring

public class MemoryMonitor {

    public static void monitorMemoryUsage(String operationName) {
        Runtime runtime = Runtime.getRuntime();
        long usedMemory = runtime.totalMemory() - runtime.freeMemory();
        long maxMemory = runtime.maxMemory();

        System.out.printf("%s - Memory usage: %d MB / %d MB (%.1f%%)%n", 
            operationName,
            usedMemory / (1024 * 1024),
            maxMemory / (1024 * 1024),
            (double) usedMemory / maxMemory * 100);
    }

    public static void parseWithMonitoring(String url) throws IOException {
        monitorMemoryUsage("Before parsing");

        Document doc = Jsoup.connect(url)
            .maxBodySize(50 * 1024 * 1024)
            .get();

        monitorMemoryUsage("After parsing");

        // Process document
        Elements elements = doc.getAllElements();
        monitorMemoryUsage("After element selection");

        // Clean up
        doc = null;
        System.gc();
        monitorMemoryUsage("After cleanup");
    }
}

Best Practices for Large Document Handling

1. Set Appropriate Limits

Connection connection = Jsoup.connect(url)
    .maxBodySize(100 * 1024 * 1024) // 100MB maximum
    .timeout(60000) // 60 second timeout
    .userAgent("Mozilla/5.0 (compatible; WebScraper/1.0)")
    .followRedirects(true);

2. Use Selective Parsing

// Parse only the needed parts
Document doc = Jsoup.connect(url).get();
Elements targetContent = doc.select("main, article, .content");

// Remove unnecessary elements early
doc.select("script, style, nav, header, footer, .sidebar").remove();

3. Process in Batches

public static void processBatchedElements(Elements elements, int batchSize) {
    for (int i = 0; i < elements.size(); i += batchSize) {
        int end = Math.min(i + batchSize, elements.size());
        Elements batch = new Elements(elements.subList(i, end));

        // Process batch
        for (Element element : batch) {
            // Your processing logic
        }

        // Optional memory cleanup
        if (i % (batchSize * 5) == 0) {
            System.gc();
        }
    }
}

Practical File Size Recommendations

Based on testing and practical experience, here are general guidelines:

  • Under 1MB: No special configuration needed
  • 1-10MB: Set maxBodySize() and monitor memory usage
  • 10-50MB: Increase JVM heap size (-Xmx2g or higher)
  • 50-100MB: Use memory-efficient parsing strategies
  • Over 100MB: Consider streaming parsers or browser automation tools

For JavaScript-heavy content requiring rendering, browser automation tools like Puppeteer may be more suitable than jsoup. When dealing with complex timeouts and retries, understanding proper timeout configuration becomes essential.

When to Consider Alternatives

For documents larger than 100-200MB or when memory is severely constrained, consider these alternatives:

  1. HTML streaming parsers: For processing HTML as a stream rather than loading into memory
  2. Browser automation tools: When dealing with JavaScript-heavy content that requires rendering
  3. Specialized XML parsers: For XML-based content that can leverage SAX or StAX parsing

Understanding memory management best practices is crucial when working with large documents in production environments.

Summary

Jsoup's maximum file size is primarily limited by available JVM heap memory rather than any hard-coded restrictions. For optimal performance:

  • Small files (< 1MB): No special configuration needed
  • Medium files (1-10MB): Set appropriate maxBodySize() limits
  • Large files (10-100MB): Increase JVM heap size and use memory-efficient parsing
  • Very large files (> 100MB): Consider alternative parsing strategies or streaming approaches

By following these guidelines and implementing proper memory management techniques, you can effectively handle documents of various sizes while maintaining application stability and performance.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon