Table of contents

What are the Most Popular Java Libraries for Web Scraping?

Java offers several powerful libraries for web scraping, each with unique strengths and use cases. Whether you're scraping static HTML content or dealing with JavaScript-heavy sites, there's a Java library suited for your needs. This comprehensive guide covers the most popular options with practical examples and implementation details.

1. JSoup - The HTML Parser Champion

JSoup is the most popular Java library for parsing and manipulating HTML documents. It's lightweight, fast, and perfect for scraping static content.

Key Features

  • CSS selector support
  • DOM manipulation capabilities
  • Clean API similar to jQuery
  • Built-in data cleaning and validation
  • Excellent performance for static content

Installation

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.16.1</version>
</dependency>

Basic JSoup Example

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class JSoupScraper {
    public static void main(String[] args) throws IOException {
        // Connect and parse the webpage
        Document doc = Jsoup.connect("https://example.com")
                .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64)")
                .get();

        // Extract title
        String title = doc.title();
        System.out.println("Title: " + title);

        // Extract all links using CSS selectors
        Elements links = doc.select("a[href]");
        for (Element link : links) {
            System.out.println("Link: " + link.attr("href"));
            System.out.println("Text: " + link.text());
        }

        // Extract specific elements by class
        Elements articles = doc.select(".article-content");
        for (Element article : articles) {
            System.out.println("Article: " + article.text());
        }
    }
}

Advanced JSoup Features

// Handle forms and POST requests
Document postDoc = Jsoup.connect("https://example.com/search")
    .data("query", "web scraping")
    .data("type", "all")
    .post();

// Set custom headers and cookies
Document customDoc = Jsoup.connect("https://api.example.com")
    .header("Accept", "application/json")
    .cookie("session", "abc123")
    .timeout(10000)
    .get();

2. HtmlUnit - The Headless Browser

HtmlUnit is a headless web browser for Java that supports JavaScript execution, making it ideal for dynamic content scraping.

Key Features

  • JavaScript support
  • Cookie management
  • Form submission capabilities
  • AJAX request handling
  • HTTP authentication support

Installation

<dependency>
    <groupId>net.sourceforge.htmlunit</groupId>
    <artifactId>htmlunit</artifactId>
    <version>3.5.0</version>
</dependency>

HtmlUnit Example

import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import com.gargoylesoftware.htmlunit.html.HtmlElement;

public class HtmlUnitScraper {
    public static void main(String[] args) throws IOException {
        try (final WebClient webClient = new WebClient()) {
            // Configure the client
            webClient.getOptions().setJavaScriptEnabled(true);
            webClient.getOptions().setCssEnabled(false);
            webClient.getOptions().setThrowExceptionOnScriptError(false);

            // Get the page
            final HtmlPage page = webClient.getPage("https://example.com");

            // Wait for JavaScript to execute
            webClient.waitForBackgroundJavaScript(10000);

            // Extract content
            String title = page.getTitleText();
            System.out.println("Title: " + title);

            // Find elements by XPath
            List<HtmlElement> elements = page.getByXPath("//div[@class='content']");
            for (HtmlElement element : elements) {
                System.out.println("Content: " + element.getTextContent());
            }
        }
    }
}

3. Selenium WebDriver - The Full Browser Solution

Selenium WebDriver provides complete browser automation capabilities, perfect for complex JavaScript-heavy sites and user interaction simulation.

Key Features

  • Full browser automation
  • Multiple browser support (Chrome, Firefox, Safari)
  • Advanced user interaction simulation
  • Screenshot capabilities
  • Extensive wait conditions

Installation

<dependency>
    <groupId>org.seleniumhq.selenium</groupId>
    <artifactId>selenium-java</artifactId>
    <version>4.15.0</version>
</dependency>

Selenium WebDriver Example

import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.By;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import org.openqa.selenium.support.ui.WebDriverWait;
import org.openqa.selenium.support.ui.ExpectedConditions;

public class SeleniumScraper {
    public static void main(String[] args) {
        // Configure Chrome options
        ChromeOptions options = new ChromeOptions();
        options.addArguments("--headless"); // Run in headless mode
        options.addArguments("--no-sandbox");
        options.addArguments("--disable-dev-shm-usage");

        WebDriver driver = new ChromeDriver(options);
        WebDriverWait wait = new WebDriverWait(driver, Duration.ofSeconds(10));

        try {
            // Navigate to the page
            driver.get("https://example.com");

            // Wait for specific element to load
            WebElement element = wait.until(
                ExpectedConditions.presenceOfElementLocated(
                    By.className("dynamic-content")
                )
            );

            // Extract data
            String title = driver.getTitle();
            System.out.println("Title: " + title);

            // Find multiple elements
            List<WebElement> links = driver.findElements(By.tagName("a"));
            for (WebElement link : links) {
                System.out.println("Link: " + link.getAttribute("href"));
                System.out.println("Text: " + link.getText());
            }

            // Interact with forms
            WebElement searchBox = driver.findElement(By.name("search"));
            searchBox.sendKeys("web scraping");
            searchBox.submit();

        } finally {
            driver.quit();
        }
    }
}

4. OkHttp + JSoup Combination

OkHttp is an excellent HTTP client that pairs well with JSoup for more control over network requests.

Installation

<dependency>
    <groupId>com.squareup.okhttp3</groupId>
    <artifactId>okhttp</artifactId>
    <version>4.12.0</version>
</dependency>

OkHttp + JSoup Example

import okhttp3.*;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class OkHttpJSoupScraper {
    public static void main(String[] args) throws IOException {
        OkHttpClient client = new OkHttpClient.Builder()
            .connectTimeout(30, TimeUnit.SECONDS)
            .readTimeout(30, TimeUnit.SECONDS)
            .build();

        Request request = new Request.Builder()
            .url("https://example.com")
            .addHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64)")
            .addHeader("Accept", "text/html,application/xhtml+xml")
            .build();

        try (Response response = client.newCall(request).execute()) {
            if (response.isSuccessful()) {
                String html = response.body().string();
                Document doc = Jsoup.parse(html);

                // Process the document
                String title = doc.title();
                System.out.println("Title: " + title);
            }
        }
    }
}

5. Apache HttpClient

Apache HttpClient provides robust HTTP functionality for complex scraping scenarios requiring advanced features like connection pooling and authentication.

Installation

<dependency>
    <groupId>org.apache.httpcomponents.client5</groupId>
    <artifactId>httpclient5</artifactId>
    <version>5.2.1</version>
</dependency>

Apache HttpClient Example

import org.apache.hc.client5.http.impl.classic.CloseableHttpClient;
import org.apache.hc.client5.http.impl.classic.HttpClients;
import org.apache.hc.client5.http.classic.methods.HttpGet;
import org.apache.hc.core5.http.io.entity.EntityUtils;

public class HttpClientScraper {
    public static void main(String[] args) throws IOException {
        try (CloseableHttpClient httpClient = HttpClients.createDefault()) {
            HttpGet request = new HttpGet("https://example.com");
            request.addHeader("User-Agent", "Java Scraper");

            String response = httpClient.execute(request, response1 -> {
                return EntityUtils.toString(response1.getEntity());
            });

            // Parse with JSoup
            Document doc = Jsoup.parse(response);
            System.out.println("Title: " + doc.title());
        }
    }
}

Library Comparison and Use Cases

When to Use Each Library

| Library | Best For | JavaScript Support | Learning Curve | Performance | |---------|----------|-------------------|----------------|-------------| | JSoup | Static HTML parsing | No | Easy | High | | HtmlUnit | Dynamic content with JS | Yes | Medium | Medium | | Selenium | Complex interactions | Yes | Medium-Hard | Low | | OkHttp + JSoup | HTTP control + parsing | No | Medium | High | | Apache HttpClient | Enterprise applications | No | Medium | High |

Performance Considerations

For high-performance scraping, consider these optimization strategies:

// Connection pooling with OkHttp
OkHttpClient client = new OkHttpClient.Builder()
    .connectionPool(new ConnectionPool(50, 5, TimeUnit.MINUTES))
    .build();

// Parallel processing with CompletableFuture
List<CompletableFuture<String>> futures = urls.stream()
    .map(url -> CompletableFuture.supplyAsync(() -> scrapeUrl(url)))
    .collect(Collectors.toList());

List<String> results = futures.stream()
    .map(CompletableFuture::join)
    .collect(Collectors.toList());

Best Practices for Java Web Scraping

1. Respect Rate Limits

// Add delays between requests
Thread.sleep(1000); // 1 second delay

// Use proper user agents
String userAgent = "Mozilla/5.0 (compatible; YourBot/1.0)";

2. Handle Errors Gracefully

try {
    Document doc = Jsoup.connect(url).get();
    // Process document
} catch (IOException e) {
    logger.error("Failed to scrape URL: " + url, e);
    // Implement retry logic
}

3. Use Connection Pooling

// Configure JSoup with custom settings
Connection connection = Jsoup.connect(url)
    .timeout(10000)
    .maxBodySize(1024 * 1024) // 1MB limit
    .followRedirects(true);

Conclusion

Java offers robust options for web scraping, from simple HTML parsing with JSoup to complex browser automation with Selenium. Choose JSoup for static content, HtmlUnit for JavaScript-enabled sites with moderate complexity, and Selenium for full browser automation needs. For enterprise applications requiring advanced HTTP features, combine OkHttp or Apache HttpClient with JSoup for optimal performance and control.

The key to successful Java web scraping is selecting the right tool for your specific use case and implementing proper error handling, rate limiting, and resource management practices.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon