How can I parse HTML content using JSoup in Java?

JSoup is a powerful Java library for parsing HTML content, manipulating DOM elements, and extracting data from web pages. It provides a convenient API similar to jQuery for selecting and manipulating HTML elements, making it an excellent choice for web scraping and HTML processing tasks in Java applications.

What is JSoup?

JSoup is an open-source Java library that parses HTML using the best available techniques. It creates a DOM tree from HTML content and provides a fluent API for finding, extracting, and manipulating data. JSoup can handle malformed HTML gracefully and is particularly useful for web scraping, data extraction, and HTML cleaning tasks.

Setting Up JSoup

Maven Dependency

Add JSoup to your Maven project by including this dependency in your pom.xml:

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.17.2</version>
</dependency>

Gradle Dependency

For Gradle projects, add this to your build.gradle:

implementation 'org.jsoup:jsoup:1.17.2'

Basic HTML Parsing

Parsing HTML from String

The most basic way to use JSoup is parsing HTML content from a string:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class BasicJSoupExample {
    public static void main(String[] args) {
        String html = "<html><head><title>Sample Page</title></head>"
                    + "<body><p class='content'>Hello, World!</p>"
                    + "<div id='main'><a href='https://example.com'>Link</a></div>"
                    + "</body></html>";

        // Parse HTML string
        Document doc = Jsoup.parse(html);

        // Extract title
        String title = doc.title();
        System.out.println("Title: " + title);

        // Extract text content
        String content = doc.select("p.content").text();
        System.out.println("Content: " + content);
    }
}

Parsing HTML from URL

JSoup can directly fetch and parse HTML from web URLs:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.io.IOException;

public class URLParsingExample {
    public static void main(String[] args) {
        try {
            // Fetch and parse HTML from URL
            Document doc = Jsoup.connect("https://example.com")
                              .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36")
                              .timeout(5000)
                              .get();

            // Extract data
            String title = doc.title();
            Elements links = doc.select("a[href]");

            System.out.println("Page title: " + title);
            System.out.println("Number of links: " + links.size());

        } catch (IOException e) {
            System.err.println("Error fetching page: " + e.getMessage());
        }
    }
}

Parsing HTML from File

You can also parse HTML content from local files:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.io.File;
import java.io.IOException;

public class FileParsingExample {
    public static void main(String[] args) {
        try {
            File input = new File("path/to/your/file.html");
            Document doc = Jsoup.parse(input, "UTF-8");

            // Process the document
            String title = doc.title();
            System.out.println("Document title: " + title);

        } catch (IOException e) {
            System.err.println("Error reading file: " + e.getMessage());
        }
    }
}

CSS Selectors and Element Selection

JSoup supports powerful CSS selectors for finding elements:

Basic Selectors

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class SelectorExample {
    public static void main(String[] args) {
        String html = "<html><body>"
                    + "<div class='container'>"
                    + "<h1 id='title'>Main Title</h1>"
                    + "<p class='text'>First paragraph</p>"
                    + "<p class='text highlight'>Second paragraph</p>"
                    + "<ul><li>Item 1</li><li>Item 2</li></ul>"
                    + "</div></body></html>";

        Document doc = Jsoup.parse(html);

        // Select by tag
        Elements paragraphs = doc.select("p");

        // Select by class
        Elements textElements = doc.select(".text");

        // Select by ID
        Element title = doc.select("#title").first();

        // Select by attribute
        Elements highlighted = doc.select("p.highlight");

        // Complex selectors
        Elements listItems = doc.select("div.container ul li");

        System.out.println("Paragraphs found: " + paragraphs.size());
        System.out.println("Title text: " + (title != null ? title.text() : "Not found"));
    }
}

Advanced Selectors

public class AdvancedSelectors {
    public static void demonstrateSelectors(Document doc) {
        // Attribute selectors
        Elements linksWithHref = doc.select("a[href]");
        Elements externalLinks = doc.select("a[href^=http]");
        Elements pdfLinks = doc.select("a[href$=.pdf]");

        // Pseudo-selectors
        Element firstParagraph = doc.select("p:first-child").first();
        Elements evenTableRows = doc.select("tr:nth-child(even)");

        // Combinators
        Elements directChildren = doc.select("div > p");
        Elements descendants = doc.select("div p");
        Elements siblings = doc.select("h1 + p");

        // Text content selectors
        Elements containsText = doc.select("p:contains(specific text)");
        Elements matchesRegex = doc.select("p:matches(\\d+)");
    }
}

Data Extraction and Manipulation

Extracting Text and Attributes

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class DataExtraction {
    public static void main(String[] args) {
        String html = "<div><a href='https://example.com' title='Example Site'>Visit Example</a>"
                    + "<img src='image.jpg' alt='Sample Image' width='300'>"
                    + "<p data-id='123'>This is a paragraph with custom data.</p></div>";

        Document doc = Jsoup.parse(html);

        // Extract text content
        Element link = doc.select("a").first();
        if (link != null) {
            String linkText = link.text();
            String href = link.attr("href");
            String title = link.attr("title");

            System.out.println("Link text: " + linkText);
            System.out.println("Link URL: " + href);
            System.out.println("Link title: " + title);
        }

        // Extract image attributes
        Element image = doc.select("img").first();
        if (image != null) {
            String src = image.attr("src");
            String alt = image.attr("alt");
            String width = image.attr("width");

            System.out.println("Image source: " + src);
            System.out.println("Alt text: " + alt);
            System.out.println("Width: " + width);
        }

        // Extract custom data attributes
        Element paragraph = doc.select("p[data-id]").first();
        if (paragraph != null) {
            String dataId = paragraph.attr("data-id");
            String text = paragraph.text();

            System.out.println("Data ID: " + dataId);
            System.out.println("Paragraph text: " + text);
        }
    }
}

Modifying HTML Content

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

public class HTMLModification {
    public static void main(String[] args) {
        String html = "<html><body><div id='content'><p>Original content</p></div></body></html>";
        Document doc = Jsoup.parse(html);

        // Modify text content
        Element content = doc.select("#content p").first();
        if (content != null) {
            content.text("Modified content");
        }

        // Add new elements
        Element contentDiv = doc.select("#content").first();
        if (contentDiv != null) {
            contentDiv.append("<p>New paragraph added</p>");
            contentDiv.prepend("<h2>Added Title</h2>");
        }

        // Modify attributes
        Element paragraph = doc.select("p").first();
        if (paragraph != null) {
            paragraph.attr("class", "modified");
            paragraph.attr("data-modified", "true");
        }

        // Remove elements
        Elements toRemove = doc.select("p:contains(Original)");
        toRemove.remove();

        // Output modified HTML
        System.out.println(doc.html());
    }
}

Web Scraping with JSoup

Complete Web Scraping Example

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

public class WebScrapingExample {

    static class Article {
        private String title;
        private String url;
        private String summary;

        public Article(String title, String url, String summary) {
            this.title = title;
            this.url = url;
            this.summary = summary;
        }

        @Override
        public String toString() {
            return "Article{title='" + title + "', url='" + url + "', summary='" + summary + "'}";
        }
    }

    public static List<Article> scrapeArticles(String url) {
        List<Article> articles = new ArrayList<>();

        try {
            Document doc = Jsoup.connect(url)
                              .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36")
                              .timeout(10000)
                              .followRedirects(true)
                              .get();

            Elements articleElements = doc.select("article, .article, .post");

            for (Element article : articleElements) {
                String title = extractTitle(article);
                String articleUrl = extractUrl(article, url);
                String summary = extractSummary(article);

                if (title != null && !title.isEmpty()) {
                    articles.add(new Article(title, articleUrl, summary));
                }
            }

        } catch (IOException e) {
            System.err.println("Error scraping articles: " + e.getMessage());
        }

        return articles;
    }

    private static String extractTitle(Element article) {
        Element titleElement = article.select("h1, h2, h3, .title, .headline").first();
        return titleElement != null ? titleElement.text().trim() : null;
    }

    private static String extractUrl(Element article, String baseUrl) {
        Element linkElement = article.select("a[href]").first();
        if (linkElement != null) {
            String href = linkElement.attr("href");
            return href.startsWith("http") ? href : baseUrl + href;
        }
        return null;
    }

    private static String extractSummary(Element article) {
        Element summaryElement = article.select("p, .summary, .excerpt").first();
        return summaryElement != null ? summaryElement.text().trim() : "";
    }

    public static void main(String[] args) {
        List<Article> articles = scrapeArticles("https://example-news-site.com");
        articles.forEach(System.out::println);
    }
}

Advanced JSoup Features

Handling Forms and POST Requests

import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.io.IOException;
import java.util.Map;

public class FormHandling {
    public static void submitForm() {
        try {
            // First, get the form page
            Document formPage = Jsoup.connect("https://example.com/login")
                                    .get();

            // Extract any hidden form fields or CSRF tokens
            String csrfToken = formPage.select("input[name=_token]").attr("value");

            // Submit form data
            Connection.Response response = Jsoup.connect("https://example.com/login")
                                               .data("username", "your_username")
                                               .data("password", "your_password")
                                               .data("_token", csrfToken)
                                               .method(Connection.Method.POST)
                                               .execute();

            // Get cookies from response
            Map<String, String> cookies = response.cookies();

            // Use cookies for subsequent requests
            Document loggedInPage = Jsoup.connect("https://example.com/dashboard")
                                        .cookies(cookies)
                                        .get();

            System.out.println("Logged in successfully");

        } catch (IOException e) {
            System.err.println("Form submission failed: " + e.getMessage());
        }
    }
}

Error Handling and Best Practices

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.net.SocketTimeoutException;

public class RobustScraping {

    public static Document safeConnect(String url, int maxRetries) {
        int retries = 0;

        while (retries < maxRetries) {
            try {
                return Jsoup.connect(url)
                           .userAgent("Mozilla/5.0 (compatible; JavaBot)")
                           .timeout(10000)
                           .followRedirects(true)
                           .ignoreHttpErrors(true)
                           .get();

            } catch (SocketTimeoutException e) {
                retries++;
                System.err.println("Timeout occurred, retry " + retries + "/" + maxRetries);

                if (retries < maxRetries) {
                    try {
                        Thread.sleep(2000); // Wait before retry
                    } catch (InterruptedException ie) {
                        Thread.currentThread().interrupt();
                        break;
                    }
                }

            } catch (IOException e) {
                System.err.println("Connection failed: " + e.getMessage());
                break;
            }
        }

        return null;
    }

    public static void safeExtraction(Document doc) {
        if (doc == null) {
            System.err.println("Document is null, cannot extract data");
            return;
        }

        // Safe element selection with null checks
        Elements titles = doc.select("h1");
        if (!titles.isEmpty()) {
            String title = titles.first().text();
            System.out.println("Title: " + title);
        } else {
            System.out.println("No title found");
        }

        // Safe attribute extraction
        Elements links = doc.select("a[href]");
        for (Element link : links) {
            String href = link.attr("href");
            String text = link.text();

            if (!href.isEmpty() && !text.isEmpty()) {
                System.out.println("Link: " + text + " -> " + href);
            }
        }
    }
}

Performance Optimization

Efficient Memory Usage

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.parser.Parser;

public class PerformanceOptimization {

    public static void optimizedParsing() {
        // For large HTML documents, consider streaming
        String html = "<!-- Large HTML content -->";

        // Parse with specific settings for better performance
        Document doc = Jsoup.parse(html, "", Parser.htmlParser());

        // Process elements in batches to reduce memory usage
        Elements elements = doc.select("div");

        int batchSize = 100;
        for (int i = 0; i < elements.size(); i += batchSize) {
            int end = Math.min(i + batchSize, elements.size());
            Elements batch = new Elements(elements.subList(i, end));

            // Process batch
            processBatch(batch);

            // Optional: Force garbage collection for large datasets
            if (i % 1000 == 0) {
                System.gc();
            }
        }
    }

    private static void processBatch(Elements batch) {
        for (Element element : batch) {
            // Process individual elements
            String text = element.text();
            // Do something with the text
        }
    }
}

Comparison with Other Technologies

While JSoup is excellent for server-side HTML parsing in Java, for JavaScript-heavy websites that require browser automation, you might need tools like Puppeteer for handling dynamic content or browser automation for authentication.

Conclusion

JSoup is a powerful and flexible library for HTML parsing in Java applications. It provides an intuitive API for selecting, extracting, and manipulating HTML content, making it ideal for web scraping, data extraction, and HTML processing tasks. Key benefits include:

Easy to use: jQuery-like syntax for element selection
Robust parsing: Handles malformed HTML gracefully
Rich API: Comprehensive methods for data extraction and manipulation
Network support: Built-in HTTP client for web scraping
Memory efficient: Optimized for processing large HTML documents

Whether you're building a web scraper, processing HTML files, or extracting data from web pages, JSoup provides the tools you need to work effectively with HTML content in Java applications.

Remember to always respect robots.txt files, implement proper rate limiting, and follow ethical web scraping practices when using JSoup for web scraping projects.

Table of contents