Table of contents

How to Parse HTML from a String with Jsoup

When working with web scraping or HTML processing in Java, you often need to parse HTML content that you already have as a string rather than fetching it from a URL. Jsoup provides powerful methods to parse HTML from strings, making it easy to work with HTML content stored in variables, files, or received from APIs.

Basic HTML String Parsing

The simplest way to parse HTML from a string in Jsoup is using the Jsoup.parse() method:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class HtmlStringParser {
    public static void main(String[] args) {
        String html = "<html><head><title>Sample Page</title></head>" +
                     "<body><h1>Welcome</h1><p class='content'>This is a paragraph.</p></body></html>";

        // Parse the HTML string
        Document doc = Jsoup.parse(html);

        // Extract elements
        String title = doc.title();
        Element heading = doc.selectFirst("h1");
        Elements paragraphs = doc.select("p.content");

        System.out.println("Title: " + title);
        System.out.println("Heading: " + heading.text());
        System.out.println("Paragraph: " + paragraphs.first().text());
    }
}

Advanced Parsing with Base URI

When parsing HTML strings that contain relative URLs, you should specify a base URI to resolve these URLs correctly:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;

public class BaseUriParser {
    public static void main(String[] args) {
        String html = "<html><body>" +
                     "<a href='/page1'>Link 1</a>" +
                     "<a href='../page2'>Link 2</a>" +
                     "<img src='images/logo.png' alt='Logo'>" +
                     "</body></html>";

        // Parse with base URI for resolving relative URLs
        String baseUri = "https://example.com/current/";
        Document doc = Jsoup.parse(html, baseUri);

        // Get absolute URLs
        Elements links = doc.select("a[href]");
        Elements images = doc.select("img[src]");

        System.out.println("Links:");
        for (Element link : links) {
            System.out.println("- " + link.attr("abs:href"));
        }

        System.out.println("Images:");
        for (Element img : images) {
            System.out.println("- " + img.attr("abs:src"));
        }
    }
}

Parsing HTML Fragments

Sometimes you need to parse HTML fragments that don't contain the full document structure. Jsoup handles this gracefully:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;

public class FragmentParser {
    public static void main(String[] args) {
        // HTML fragment without <html> or <body> tags
        String htmlFragment = "<div class='container'>" +
                             "<h2>Product List</h2>" +
                             "<ul>" +
                             "<li data-id='1'>Product A - $29.99</li>" +
                             "<li data-id='2'>Product B - $39.99</li>" +
                             "</ul>" +
                             "</div>";

        // Jsoup automatically wraps fragments in proper HTML structure
        Document doc = Jsoup.parse(htmlFragment);

        // Extract product information
        Elements products = doc.select("li[data-id]");

        System.out.println("Products found:");
        for (Element product : products) {
            String id = product.attr("data-id");
            String text = product.text();
            System.out.println("ID: " + id + ", Details: " + text);
        }
    }
}

Parsing HTML from Files

You can also parse HTML content that you've read from files:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;

public class FileHtmlParser {
    public static void main(String[] args) {
        try {
            // Read HTML content from file
            String htmlContent = new String(Files.readAllBytes(Paths.get("sample.html")));

            // Parse the HTML string
            Document doc = Jsoup.parse(htmlContent);

            // Process the document
            System.out.println("Page title: " + doc.title());
            System.out.println("Meta description: " + 
                doc.select("meta[name=description]").attr("content"));

        } catch (IOException e) {
            System.err.println("Error reading file: " + e.getMessage());
        }
    }
}

Working with Malformed HTML

One of Jsoup's strengths is handling malformed HTML gracefully. It automatically fixes common issues:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class MalformedHtmlParser {
    public static void main(String[] args) {
        // Malformed HTML with unclosed tags and invalid nesting
        String malformedHtml = "<html><body>" +
                              "<div><p>Unclosed paragraph" +
                              "<span>Nested span</div>" +
                              "<img src='image.jpg'>" +
                              "</body>";

        // Jsoup fixes the structure automatically
        Document doc = Jsoup.parse(malformedHtml);

        // Output the cleaned HTML
        System.out.println("Cleaned HTML:");
        System.out.println(doc.html());

        // Extract elements normally
        System.out.println("\nParagraph text: " + doc.select("p").text());
        System.out.println("Image source: " + doc.select("img").attr("src"));
    }
}

Parsing with Custom Parser Settings

For more control over the parsing process, you can use custom parser settings:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.parser.Parser;

public class CustomParserSettings {
    public static void main(String[] args) {
        String html = "<html><body><p>Content with &nbsp; entities</p></body></html>";

        // Parse with XML parser for stricter parsing
        Document xmlDoc = Jsoup.parse(html, "", Parser.xmlParser());

        // Parse with HTML parser (default)
        Document htmlDoc = Jsoup.parse(html);

        System.out.println("XML parser result: " + xmlDoc.select("p").text());
        System.out.println("HTML parser result: " + htmlDoc.select("p").text());

        // Custom settings for preserving case
        Parser customParser = Parser.htmlParser();
        customParser.settings().preserveTagCase(true);
        customParser.settings().preserveAttributeCase(true);

        Document customDoc = Jsoup.parse(html, "", customParser);
        System.out.println("Custom parser preserves case");
    }
}

Practical Example: Processing API Response

Here's a real-world example of parsing HTML content received from an API:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.util.HashMap;
import java.util.Map;

public class ApiResponseParser {
    public static void main(String[] args) {
        // Simulate HTML content received from an API
        String apiResponse = "<div class='article'>" +
                           "<h1>How to Use Web Scraping APIs</h1>" +
                           "<div class='metadata'>" +
                           "<span class='author'>John Doe</span>" +
                           "<span class='date'>2024-01-15</span>" +
                           "</div>" +
                           "<div class='content'>" +
                           "<p>Web scraping APIs provide powerful tools...</p>" +
                           "<p>They can handle <a href='/javascript-rendering'>JavaScript rendering</a>...</p>" +
                           "</div>" +
                           "</div>";

        // Parse and extract structured data
        Document doc = Jsoup.parse(apiResponse);
        Map<String, String> articleData = parseArticle(doc);

        // Display extracted data
        articleData.forEach((key, value) -> 
            System.out.println(key + ": " + value));
    }

    private static Map<String, String> parseArticle(Document doc) {
        Map<String, String> data = new HashMap<>();

        // Extract article title
        Element title = doc.selectFirst("h1");
        if (title != null) {
            data.put("title", title.text());
        }

        // Extract metadata
        Element author = doc.selectFirst(".metadata .author");
        Element date = doc.selectFirst(".metadata .date");

        if (author != null) data.put("author", author.text());
        if (date != null) data.put("date", date.text());

        // Extract content paragraphs
        Elements paragraphs = doc.select(".content p");
        StringBuilder content = new StringBuilder();
        for (Element p : paragraphs) {
            content.append(p.text()).append(" ");
        }
        data.put("content", content.toString().trim());

        // Extract links
        Elements links = doc.select(".content a[href]");
        if (!links.isEmpty()) {
            data.put("links", links.attr("href"));
        }

        return data;
    }
}

Error Handling and Best Practices

When parsing HTML strings, it's important to handle potential errors gracefully:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;

public class SafeHtmlParser {
    public static Document safeParseHtml(String html) {
        try {
            if (html == null || html.trim().isEmpty()) {
                return new Document("");
            }
            return Jsoup.parse(html);
        } catch (Exception e) {
            System.err.println("Error parsing HTML: " + e.getMessage());
            return new Document("");
        }
    }

    public static void main(String[] args) {
        String[] testCases = {
            "<html><body><h1>Valid HTML</h1></body></html>",
            null,
            "",
            "<invalid>Unclosed tag",
            "Plain text without HTML tags"
        };

        for (String html : testCases) {
            Document doc = safeParseHtml(html);
            Elements headings = doc.select("h1");

            System.out.println("Input: " + (html != null ? html.substring(0, Math.min(html.length(), 30)) : "null"));
            System.out.println("Headings found: " + headings.size());
            System.out.println("---");
        }
    }
}

Performance Considerations

When parsing large amounts of HTML content, consider these performance tips:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.parser.Parser;

public class PerformanceOptimized {
    public static void main(String[] args) {
        String largeHtml = generateLargeHtmlString();

        // For better performance with large documents
        long startTime = System.currentTimeMillis();

        // Use parser settings for better memory usage
        Parser parser = Parser.htmlParser();
        parser.setTrackErrors(false); // Disable error tracking for performance

        Document doc = parser.parseInput(largeHtml, "");

        long endTime = System.currentTimeMillis();
        System.out.println("Parsing took: " + (endTime - startTime) + "ms");

        // Extract only what you need
        System.out.println("Document has " + doc.select("*").size() + " elements");
    }

    private static String generateLargeHtmlString() {
        StringBuilder html = new StringBuilder("<html><body>");
        for (int i = 0; i < 1000; i++) {
            html.append("<div class='item-").append(i).append("'>")
                .append("<h3>Item ").append(i).append("</h3>")
                .append("<p>Description for item ").append(i).append("</p>")
                .append("</div>");
        }
        html.append("</body></html>");
        return html.toString();
    }
}

Integration with Web Scraping Workflows

Parsing HTML from strings is particularly useful when working with JavaScript rendering solutions or when you need to process HTML content obtained through other means. You can also combine string parsing with browser automation tools for comprehensive web scraping solutions.

Conclusion

Jsoup's string parsing capabilities make it an excellent choice for processing HTML content in Java applications. Whether you're working with API responses, file content, or fragments of HTML, Jsoup provides robust parsing with automatic error correction and a powerful selection API. The key methods to remember are:

  • Jsoup.parse(html) for basic string parsing
  • Jsoup.parse(html, baseUri) for resolving relative URLs
  • Custom parser settings for specialized requirements
  • Proper error handling for production applications

By following these patterns and best practices, you can efficiently parse and extract data from HTML strings in your Java applications while maintaining code reliability and performance.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon