Jsoup is a high-performance Java library for HTML parsing and web scraping. When scraping large websites, understanding its performance characteristics is crucial for building efficient and scalable scrapers.
Performance Overview
Jsoup excels at HTML parsing speed, typically processing 1-2MB HTML documents in 50-100ms on modern hardware. However, performance varies significantly based on several factors:
Key Performance Factors
1. HTML Parsing Speed
Jsoup uses an optimized HTML5-compliant parser that handles malformed HTML gracefully. Parse time scales linearly with document size:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class PerformanceTest {
public static void main(String[] args) {
long startTime = System.currentTimeMillis();
try {
Document doc = Jsoup.connect("https://example.com")
.timeout(30000)
.get();
long parseTime = System.currentTimeMillis() - startTime;
System.out.println("Parse time: " + parseTime + "ms");
System.out.println("Document size: " + doc.html().length() + " chars");
} catch (Exception e) {
e.printStackTrace();
}
}
}
2. Memory Usage Optimization
Jsoup loads entire documents into memory as DOM trees. For large documents (>10MB), this can consume significant RAM:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.parser.Parser;
// Memory-efficient approach for large documents
public class MemoryOptimizedScraping {
public static void extractData(String url) {
try {
// Configure connection with limits
Document doc = Jsoup.connect(url)
.maxBodySize(5 * 1024 * 1024) // Limit to 5MB
.timeout(30000)
.get();
// Extract only needed data immediately
Elements targetData = doc.select("div.content");
// Process and clear references
processData(targetData);
doc = null; // Help GC
} catch (Exception e) {
System.err.println("Error processing: " + url);
}
}
private static void processData(Elements elements) {
for (Element element : elements) {
System.out.println(element.text());
}
}
}
3. Selector Performance
CSS selector complexity significantly impacts traversal speed:
// Fast selectors
Elements fast1 = doc.select("div.product"); // Class selector
Elements fast2 = doc.select("#main-content"); // ID selector
Elements fast3 = doc.select("article > h2"); // Direct child
// Slower selectors
Elements slow1 = doc.select("div:contains(product)"); // Text contains
Elements slow2 = doc.select("*[data-id]"); // Universal with attribute
Elements slow3 = doc.select("div div div span"); // Deep traversal
Large-Scale Scraping Implementation
Concurrent Processing with Thread Pool
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.util.concurrent.*;
import java.util.List;
import java.util.ArrayList;
public class ConcurrentJsoupScraper {
private final ExecutorService executor;
private final int threadCount;
public ConcurrentJsoupScraper(int threadCount) {
this.threadCount = threadCount;
this.executor = Executors.newFixedThreadPool(threadCount);
}
public void scrapeUrls(List<String> urls) {
List<Future<String>> futures = new ArrayList<>();
for (String url : urls) {
Future<String> future = executor.submit(() -> {
try {
// Add delay to respect rate limits
Thread.sleep(1000);
Document doc = Jsoup.connect(url)
.userAgent("Mozilla/5.0 (compatible; Bot/1.0)")
.timeout(15000)
.followRedirects(true)
.get();
return extractContent(doc);
} catch (Exception e) {
System.err.println("Failed to scrape: " + url);
return null;
}
});
futures.add(future);
}
// Collect results
for (Future<String> future : futures) {
try {
String result = future.get(30, TimeUnit.SECONDS);
if (result != null) {
System.out.println(result);
}
} catch (Exception e) {
System.err.println("Task failed: " + e.getMessage());
}
}
}
private String extractContent(Document doc) {
return doc.select("h1, h2, p").text();
}
public void shutdown() {
executor.shutdown();
}
}
Connection Pool and Configuration
import org.jsoup.Connection;
import org.jsoup.Jsoup;
public class OptimizedJsoupClient {
private static final String USER_AGENT =
"Mozilla/5.0 (compatible; WebScraper/1.0)";
public static Connection.Response fetchWithRetry(String url, int maxRetries) {
for (int attempt = 0; attempt < maxRetries; attempt++) {
try {
Connection.Response response = Jsoup.connect(url)
.userAgent(USER_AGENT)
.timeout(30000)
.followRedirects(true)
.validateTLSCertificates(false) // For development only
.header("Accept", "text/html,application/xhtml+xml")
.header("Accept-Language", "en-US,en;q=0.5")
.header("Accept-Encoding", "gzip, deflate")
.header("Connection", "keep-alive")
.execute();
if (response.statusCode() == 200) {
return response;
}
} catch (Exception e) {
System.err.println("Attempt " + (attempt + 1) + " failed: " + e.getMessage());
if (attempt < maxRetries - 1) {
try {
Thread.sleep(2000 * (attempt + 1)); // Exponential backoff
} catch (InterruptedException ie) {
Thread.currentThread().interrupt();
break;
}
}
}
}
return null;
}
}
Performance Optimization Strategies
1. Efficient Data Extraction
public class EfficientExtraction {
public static void extractProductData(Document doc) {
// Use specific selectors instead of broad searches
Elements products = doc.select("div.product-item");
for (Element product : products) {
// Extract data in one pass
String name = product.selectFirst("h3.title")?.text();
String price = product.selectFirst("span.price")?.text();
String image = product.selectFirst("img")?.attr("src");
// Process immediately instead of storing all in memory
processProduct(name, price, image);
}
}
private static void processProduct(String name, String price, String image) {
// Save to database, file, or API
System.out.println("Product: " + name + " - " + price);
}
}
2. Memory Management
// Configure JVM for large-scale scraping
// -Xmx4g -XX:+UseG1GC -XX:MaxGCPauseMillis=200
public class MemoryEfficientScraper {
public void processBatch(List<String> urls) {
for (String url : urls) {
try {
Document doc = Jsoup.connect(url)
.maxBodySize(2 * 1024 * 1024) // 2MB limit
.get();
// Process immediately
processDocument(doc);
// Clear reference to help GC
doc = null;
// Periodic GC hint for large batches
if (urls.indexOf(url) % 100 == 0) {
System.gc();
}
} catch (Exception e) {
System.err.println("Error: " + e.getMessage());
}
}
}
}
Performance Benchmarks
Based on typical usage patterns:
- Small pages (< 100KB): 10-50ms parse time
- Medium pages (100KB-1MB): 50-200ms parse time
- Large pages (1-5MB): 200-1000ms parse time
- Memory usage: ~3-5x document size in RAM
- Concurrent throughput: 50-200 pages/second (depending on network and site)
Best Practices for Large Websites
- Implement Rate Limiting
// Use libraries like Guava RateLimiter
RateLimiter rateLimiter = RateLimiter.create(2.0); // 2 requests per second
rateLimiter.acquire(); // Before each request
- Handle Errors Gracefully
try {
Document doc = Jsoup.connect(url).get();
} catch (HttpStatusException e) {
if (e.getStatusCode() == 429) {
// Rate limited - increase delay
Thread.sleep(5000);
}
} catch (SocketTimeoutException e) {
// Increase timeout or retry
}
- Monitor Performance
long startTime = System.nanoTime();
Document doc = Jsoup.connect(url).get();
long duration = (System.nanoTime() - startTime) / 1_000_000;
System.out.println("Parse time: " + duration + "ms");
- Use Connection Pooling for high-volume scraping
- Implement Caching to avoid re-processing identical content
- Respect robots.txt and implement proper delays
- Use Proxies to distribute load and avoid IP blocking
Jsoup provides excellent performance for most web scraping tasks, but success with large websites requires careful attention to concurrency, memory management, and respectful scraping practices.