Table of contents

Can jsoup be used in a multithreaded application?

Yes, jsoup can be used safely in multithreaded applications when proper thread safety patterns are followed. While jsoup's core data structures are not inherently thread-safe, you can achieve thread safety through careful design and implementation.

Thread Safety Guidelines

1. Use Separate Document Instances Per Thread

Each thread should work with its own Document object. Never share mutable jsoup objects across threads unless they are read-only.

// Safe: Each thread gets its own Document
Runnable task = () -> {
    Document doc = Jsoup.connect("https://example.com").get();
    // Process doc safely within this thread
};

// Unsafe: Sharing mutable Document across threads
Document sharedDoc = Jsoup.connect("https://example.com").get();
// Multiple threads modifying sharedDoc = race conditions

2. Read-Only Access is Thread-Safe

Once a Document is fully constructed, multiple threads can safely read from it simultaneously without synchronization.

// Parse once, read from multiple threads
Document document = Jsoup.parse(htmlContent);

// Safe: Multiple threads reading concurrently
Runnable readTask = () -> {
    String title = document.title();
    Elements links = document.select("a[href]");
    // Read operations are thread-safe
};

3. Synchronize Modifications

If you must modify a shared Document, use proper synchronization mechanisms.

public class ThreadSafeDocumentWrapper {
    private final Document document;
    private final Object lock = new Object();

    public ThreadSafeDocumentWrapper(Document document) {
        this.document = document;
    }

    public void safeModification(String selector, String newText) {
        synchronized (lock) {
            Elements elements = document.select(selector);
            elements.text(newText);
        }
    }

    public String safeRead(String selector) {
        // No synchronization needed for reads
        return document.select(selector).text();
    }
}

Practical Examples

Basic Thread Pool Example

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
import java.util.List;
import java.util.ArrayList;

public class JsoupThreadPoolExample {
    private static final List<String> URLS = List.of(
        "https://example.com",
        "https://httpbin.org/html",
        "https://quotes.toscrape.com"
    );

    public static void main(String[] args) throws Exception {
        ExecutorService executor = Executors.newFixedThreadPool(3);
        List<Future<String>> futures = new ArrayList<>();

        // Submit tasks to thread pool
        for (String url : URLS) {
            Future<String> future = executor.submit(() -> {
                try {
                    Document doc = Jsoup.connect(url)
                        .userAgent("Mozilla/5.0")
                        .timeout(5000)
                        .get();

                    return String.format("Thread %s processed %s: %s",
                        Thread.currentThread().getName(),
                        url,
                        doc.title());
                } catch (Exception e) {
                    return "Error processing " + url + ": " + e.getMessage();
                }
            });
            futures.add(future);
        }

        // Collect results
        for (Future<String> future : futures) {
            System.out.println(future.get());
        }

        executor.shutdown();
    }
}

Producer-Consumer Pattern

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.util.concurrent.BlockingQueue;
import java.util.concurrent.LinkedBlockingQueue;

public class JsoupProducerConsumer {
    private static final BlockingQueue<String> urlQueue = new LinkedBlockingQueue<>();
    private static final BlockingQueue<Document> resultQueue = new LinkedBlockingQueue<>();

    static class UrlProducer implements Runnable {
        @Override
        public void run() {
            try {
                urlQueue.put("https://example.com");
                urlQueue.put("https://httpbin.org/html");
                urlQueue.put("STOP"); // Sentinel value
            } catch (InterruptedException e) {
                Thread.currentThread().interrupt();
            }
        }
    }

    static class DocumentProcessor implements Runnable {
        @Override
        public void run() {
            try {
                while (true) {
                    String url = urlQueue.take();
                    if ("STOP".equals(url)) {
                        break;
                    }

                    // Each thread gets its own Document
                    Document doc = Jsoup.connect(url).get();
                    resultQueue.put(doc);
                }
            } catch (Exception e) {
                Thread.currentThread().interrupt();
            }
        }
    }

    public static void main(String[] args) throws Exception {
        Thread producer = new Thread(new UrlProducer());
        Thread processor = new Thread(new DocumentProcessor());

        producer.start();
        processor.start();

        // Process results
        Document doc;
        while ((doc = resultQueue.poll()) != null) {
            System.out.println("Processed: " + doc.title());
        }

        producer.join();
        processor.join();
    }
}

Thread-Local Storage for Configuration

import org.jsoup.Jsoup;
import org.jsoup.Connection;

public class JsoupThreadLocalExample {
    private static final ThreadLocal<Connection> connectionCache = 
        ThreadLocal.withInitial(() -> 
            Jsoup.connect("")
                .userAgent("Mozilla/5.0")
                .timeout(5000)
                .followRedirects(true)
        );

    public static Document fetchDocument(String url) throws Exception {
        // Reuse thread-local connection configuration
        Connection connection = connectionCache.get();
        return connection.url(url).get();
    }

    public static void main(String[] args) {
        Runnable task = () -> {
            try {
                // Each thread uses its own connection instance
                Document doc = fetchDocument("https://example.com");
                System.out.println(Thread.currentThread().getName() + 
                    ": " + doc.title());
            } catch (Exception e) {
                e.printStackTrace();
            }
        };

        // Start multiple threads
        for (int i = 0; i < 3; i++) {
            new Thread(task).start();
        }
    }
}

Best Practices

  1. Prefer Thread Confinement: Keep Document objects within single threads whenever possible
  2. Use Thread Pools: Manage thread lifecycle with ExecutorService instead of creating threads manually
  3. Handle Exceptions Properly: Network operations can fail; implement proper error handling
  4. Set Timeouts: Always configure connection timeouts to prevent hanging threads
  5. Monitor Resource Usage: Be mindful of memory usage when processing large documents in parallel

What to Avoid

  • Sharing mutable Document or Element objects across threads without synchronization
  • Modifying shared jsoup objects from multiple threads simultaneously
  • Creating excessive threads without proper management
  • Ignoring connection timeouts in multithreaded scenarios

By following these patterns and guidelines, you can safely leverage jsoup's HTML parsing capabilities in multithreaded Java applications while maintaining both performance and thread safety.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon