Yes, jsoup can be used safely in multithreaded applications when proper thread safety patterns are followed. While jsoup's core data structures are not inherently thread-safe, you can achieve thread safety through careful design and implementation.
Thread Safety Guidelines
1. Use Separate Document Instances Per Thread
Each thread should work with its own Document
object. Never share mutable jsoup objects across threads unless they are read-only.
// Safe: Each thread gets its own Document
Runnable task = () -> {
Document doc = Jsoup.connect("https://example.com").get();
// Process doc safely within this thread
};
// Unsafe: Sharing mutable Document across threads
Document sharedDoc = Jsoup.connect("https://example.com").get();
// Multiple threads modifying sharedDoc = race conditions
2. Read-Only Access is Thread-Safe
Once a Document
is fully constructed, multiple threads can safely read from it simultaneously without synchronization.
// Parse once, read from multiple threads
Document document = Jsoup.parse(htmlContent);
// Safe: Multiple threads reading concurrently
Runnable readTask = () -> {
String title = document.title();
Elements links = document.select("a[href]");
// Read operations are thread-safe
};
3. Synchronize Modifications
If you must modify a shared Document
, use proper synchronization mechanisms.
public class ThreadSafeDocumentWrapper {
private final Document document;
private final Object lock = new Object();
public ThreadSafeDocumentWrapper(Document document) {
this.document = document;
}
public void safeModification(String selector, String newText) {
synchronized (lock) {
Elements elements = document.select(selector);
elements.text(newText);
}
}
public String safeRead(String selector) {
// No synchronization needed for reads
return document.select(selector).text();
}
}
Practical Examples
Basic Thread Pool Example
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
import java.util.List;
import java.util.ArrayList;
public class JsoupThreadPoolExample {
private static final List<String> URLS = List.of(
"https://example.com",
"https://httpbin.org/html",
"https://quotes.toscrape.com"
);
public static void main(String[] args) throws Exception {
ExecutorService executor = Executors.newFixedThreadPool(3);
List<Future<String>> futures = new ArrayList<>();
// Submit tasks to thread pool
for (String url : URLS) {
Future<String> future = executor.submit(() -> {
try {
Document doc = Jsoup.connect(url)
.userAgent("Mozilla/5.0")
.timeout(5000)
.get();
return String.format("Thread %s processed %s: %s",
Thread.currentThread().getName(),
url,
doc.title());
} catch (Exception e) {
return "Error processing " + url + ": " + e.getMessage();
}
});
futures.add(future);
}
// Collect results
for (Future<String> future : futures) {
System.out.println(future.get());
}
executor.shutdown();
}
}
Producer-Consumer Pattern
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.util.concurrent.BlockingQueue;
import java.util.concurrent.LinkedBlockingQueue;
public class JsoupProducerConsumer {
private static final BlockingQueue<String> urlQueue = new LinkedBlockingQueue<>();
private static final BlockingQueue<Document> resultQueue = new LinkedBlockingQueue<>();
static class UrlProducer implements Runnable {
@Override
public void run() {
try {
urlQueue.put("https://example.com");
urlQueue.put("https://httpbin.org/html");
urlQueue.put("STOP"); // Sentinel value
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
}
}
static class DocumentProcessor implements Runnable {
@Override
public void run() {
try {
while (true) {
String url = urlQueue.take();
if ("STOP".equals(url)) {
break;
}
// Each thread gets its own Document
Document doc = Jsoup.connect(url).get();
resultQueue.put(doc);
}
} catch (Exception e) {
Thread.currentThread().interrupt();
}
}
}
public static void main(String[] args) throws Exception {
Thread producer = new Thread(new UrlProducer());
Thread processor = new Thread(new DocumentProcessor());
producer.start();
processor.start();
// Process results
Document doc;
while ((doc = resultQueue.poll()) != null) {
System.out.println("Processed: " + doc.title());
}
producer.join();
processor.join();
}
}
Thread-Local Storage for Configuration
import org.jsoup.Jsoup;
import org.jsoup.Connection;
public class JsoupThreadLocalExample {
private static final ThreadLocal<Connection> connectionCache =
ThreadLocal.withInitial(() ->
Jsoup.connect("")
.userAgent("Mozilla/5.0")
.timeout(5000)
.followRedirects(true)
);
public static Document fetchDocument(String url) throws Exception {
// Reuse thread-local connection configuration
Connection connection = connectionCache.get();
return connection.url(url).get();
}
public static void main(String[] args) {
Runnable task = () -> {
try {
// Each thread uses its own connection instance
Document doc = fetchDocument("https://example.com");
System.out.println(Thread.currentThread().getName() +
": " + doc.title());
} catch (Exception e) {
e.printStackTrace();
}
};
// Start multiple threads
for (int i = 0; i < 3; i++) {
new Thread(task).start();
}
}
}
Best Practices
- Prefer Thread Confinement: Keep
Document
objects within single threads whenever possible - Use Thread Pools: Manage thread lifecycle with
ExecutorService
instead of creating threads manually - Handle Exceptions Properly: Network operations can fail; implement proper error handling
- Set Timeouts: Always configure connection timeouts to prevent hanging threads
- Monitor Resource Usage: Be mindful of memory usage when processing large documents in parallel
What to Avoid
- Sharing mutable
Document
orElement
objects across threads without synchronization - Modifying shared jsoup objects from multiple threads simultaneously
- Creating excessive threads without proper management
- Ignoring connection timeouts in multithreaded scenarios
By following these patterns and guidelines, you can safely leverage jsoup's HTML parsing capabilities in multithreaded Java applications while maintaining both performance and thread safety.