How to Parse HTML from a String with Jsoup
When working with web scraping or HTML processing in Java, you often need to parse HTML content that you already have as a string rather than fetching it from a URL. Jsoup provides powerful methods to parse HTML from strings, making it easy to work with HTML content stored in variables, files, or received from APIs.
Basic HTML String Parsing
The simplest way to parse HTML from a string in Jsoup is using the Jsoup.parse()
method:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class HtmlStringParser {
public static void main(String[] args) {
String html = "<html><head><title>Sample Page</title></head>" +
"<body><h1>Welcome</h1><p class='content'>This is a paragraph.</p></body></html>";
// Parse the HTML string
Document doc = Jsoup.parse(html);
// Extract elements
String title = doc.title();
Element heading = doc.selectFirst("h1");
Elements paragraphs = doc.select("p.content");
System.out.println("Title: " + title);
System.out.println("Heading: " + heading.text());
System.out.println("Paragraph: " + paragraphs.first().text());
}
}
Advanced Parsing with Base URI
When parsing HTML strings that contain relative URLs, you should specify a base URI to resolve these URLs correctly:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public class BaseUriParser {
public static void main(String[] args) {
String html = "<html><body>" +
"<a href='/page1'>Link 1</a>" +
"<a href='../page2'>Link 2</a>" +
"<img src='images/logo.png' alt='Logo'>" +
"</body></html>";
// Parse with base URI for resolving relative URLs
String baseUri = "https://example.com/current/";
Document doc = Jsoup.parse(html, baseUri);
// Get absolute URLs
Elements links = doc.select("a[href]");
Elements images = doc.select("img[src]");
System.out.println("Links:");
for (Element link : links) {
System.out.println("- " + link.attr("abs:href"));
}
System.out.println("Images:");
for (Element img : images) {
System.out.println("- " + img.attr("abs:src"));
}
}
}
Parsing HTML Fragments
Sometimes you need to parse HTML fragments that don't contain the full document structure. Jsoup handles this gracefully:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public class FragmentParser {
public static void main(String[] args) {
// HTML fragment without <html> or <body> tags
String htmlFragment = "<div class='container'>" +
"<h2>Product List</h2>" +
"<ul>" +
"<li data-id='1'>Product A - $29.99</li>" +
"<li data-id='2'>Product B - $39.99</li>" +
"</ul>" +
"</div>";
// Jsoup automatically wraps fragments in proper HTML structure
Document doc = Jsoup.parse(htmlFragment);
// Extract product information
Elements products = doc.select("li[data-id]");
System.out.println("Products found:");
for (Element product : products) {
String id = product.attr("data-id");
String text = product.text();
System.out.println("ID: " + id + ", Details: " + text);
}
}
}
Parsing HTML from Files
You can also parse HTML content that you've read from files:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
public class FileHtmlParser {
public static void main(String[] args) {
try {
// Read HTML content from file
String htmlContent = new String(Files.readAllBytes(Paths.get("sample.html")));
// Parse the HTML string
Document doc = Jsoup.parse(htmlContent);
// Process the document
System.out.println("Page title: " + doc.title());
System.out.println("Meta description: " +
doc.select("meta[name=description]").attr("content"));
} catch (IOException e) {
System.err.println("Error reading file: " + e.getMessage());
}
}
}
Working with Malformed HTML
One of Jsoup's strengths is handling malformed HTML gracefully. It automatically fixes common issues:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class MalformedHtmlParser {
public static void main(String[] args) {
// Malformed HTML with unclosed tags and invalid nesting
String malformedHtml = "<html><body>" +
"<div><p>Unclosed paragraph" +
"<span>Nested span</div>" +
"<img src='image.jpg'>" +
"</body>";
// Jsoup fixes the structure automatically
Document doc = Jsoup.parse(malformedHtml);
// Output the cleaned HTML
System.out.println("Cleaned HTML:");
System.out.println(doc.html());
// Extract elements normally
System.out.println("\nParagraph text: " + doc.select("p").text());
System.out.println("Image source: " + doc.select("img").attr("src"));
}
}
Parsing with Custom Parser Settings
For more control over the parsing process, you can use custom parser settings:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.parser.Parser;
public class CustomParserSettings {
public static void main(String[] args) {
String html = "<html><body><p>Content with entities</p></body></html>";
// Parse with XML parser for stricter parsing
Document xmlDoc = Jsoup.parse(html, "", Parser.xmlParser());
// Parse with HTML parser (default)
Document htmlDoc = Jsoup.parse(html);
System.out.println("XML parser result: " + xmlDoc.select("p").text());
System.out.println("HTML parser result: " + htmlDoc.select("p").text());
// Custom settings for preserving case
Parser customParser = Parser.htmlParser();
customParser.settings().preserveTagCase(true);
customParser.settings().preserveAttributeCase(true);
Document customDoc = Jsoup.parse(html, "", customParser);
System.out.println("Custom parser preserves case");
}
}
Practical Example: Processing API Response
Here's a real-world example of parsing HTML content received from an API:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.util.HashMap;
import java.util.Map;
public class ApiResponseParser {
public static void main(String[] args) {
// Simulate HTML content received from an API
String apiResponse = "<div class='article'>" +
"<h1>How to Use Web Scraping APIs</h1>" +
"<div class='metadata'>" +
"<span class='author'>John Doe</span>" +
"<span class='date'>2024-01-15</span>" +
"</div>" +
"<div class='content'>" +
"<p>Web scraping APIs provide powerful tools...</p>" +
"<p>They can handle <a href='/javascript-rendering'>JavaScript rendering</a>...</p>" +
"</div>" +
"</div>";
// Parse and extract structured data
Document doc = Jsoup.parse(apiResponse);
Map<String, String> articleData = parseArticle(doc);
// Display extracted data
articleData.forEach((key, value) ->
System.out.println(key + ": " + value));
}
private static Map<String, String> parseArticle(Document doc) {
Map<String, String> data = new HashMap<>();
// Extract article title
Element title = doc.selectFirst("h1");
if (title != null) {
data.put("title", title.text());
}
// Extract metadata
Element author = doc.selectFirst(".metadata .author");
Element date = doc.selectFirst(".metadata .date");
if (author != null) data.put("author", author.text());
if (date != null) data.put("date", date.text());
// Extract content paragraphs
Elements paragraphs = doc.select(".content p");
StringBuilder content = new StringBuilder();
for (Element p : paragraphs) {
content.append(p.text()).append(" ");
}
data.put("content", content.toString().trim());
// Extract links
Elements links = doc.select(".content a[href]");
if (!links.isEmpty()) {
data.put("links", links.attr("href"));
}
return data;
}
}
Error Handling and Best Practices
When parsing HTML strings, it's important to handle potential errors gracefully:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public class SafeHtmlParser {
public static Document safeParseHtml(String html) {
try {
if (html == null || html.trim().isEmpty()) {
return new Document("");
}
return Jsoup.parse(html);
} catch (Exception e) {
System.err.println("Error parsing HTML: " + e.getMessage());
return new Document("");
}
}
public static void main(String[] args) {
String[] testCases = {
"<html><body><h1>Valid HTML</h1></body></html>",
null,
"",
"<invalid>Unclosed tag",
"Plain text without HTML tags"
};
for (String html : testCases) {
Document doc = safeParseHtml(html);
Elements headings = doc.select("h1");
System.out.println("Input: " + (html != null ? html.substring(0, Math.min(html.length(), 30)) : "null"));
System.out.println("Headings found: " + headings.size());
System.out.println("---");
}
}
}
Performance Considerations
When parsing large amounts of HTML content, consider these performance tips:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.parser.Parser;
public class PerformanceOptimized {
public static void main(String[] args) {
String largeHtml = generateLargeHtmlString();
// For better performance with large documents
long startTime = System.currentTimeMillis();
// Use parser settings for better memory usage
Parser parser = Parser.htmlParser();
parser.setTrackErrors(false); // Disable error tracking for performance
Document doc = parser.parseInput(largeHtml, "");
long endTime = System.currentTimeMillis();
System.out.println("Parsing took: " + (endTime - startTime) + "ms");
// Extract only what you need
System.out.println("Document has " + doc.select("*").size() + " elements");
}
private static String generateLargeHtmlString() {
StringBuilder html = new StringBuilder("<html><body>");
for (int i = 0; i < 1000; i++) {
html.append("<div class='item-").append(i).append("'>")
.append("<h3>Item ").append(i).append("</h3>")
.append("<p>Description for item ").append(i).append("</p>")
.append("</div>");
}
html.append("</body></html>");
return html.toString();
}
}
Integration with Web Scraping Workflows
Parsing HTML from strings is particularly useful when working with JavaScript rendering solutions or when you need to process HTML content obtained through other means. You can also combine string parsing with browser automation tools for comprehensive web scraping solutions.
Conclusion
Jsoup's string parsing capabilities make it an excellent choice for processing HTML content in Java applications. Whether you're working with API responses, file content, or fragments of HTML, Jsoup provides robust parsing with automatic error correction and a powerful selection API. The key methods to remember are:
Jsoup.parse(html)
for basic string parsingJsoup.parse(html, baseUri)
for resolving relative URLs- Custom parser settings for specialized requirements
- Proper error handling for production applications
By following these patterns and best practices, you can efficiently parse and extract data from HTML strings in your Java applications while maintaining code reliability and performance.