How do I modify HTML content using jsoup?
jsoup is a powerful Java library that allows you to parse, manipulate, and modify HTML documents with ease. Unlike browser-based automation tools like Puppeteer for handling dynamic content, jsoup works directly with HTML structures, making it ideal for server-side HTML processing and web scraping applications.
Setting Up jsoup
First, add jsoup to your Java project:
Maven
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.16.1</version>
</dependency>
Gradle
implementation 'org.jsoup:jsoup:1.16.1'
Basic HTML Modification Operations
1. Modifying Text Content
The most common HTML modification is changing the text content of elements:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
// Parse HTML from string
String html = "<html><body><h1>Old Title</h1><p class='content'>Old paragraph</p></body></html>";
Document doc = Jsoup.parse(html);
// Modify text content
Element title = doc.select("h1").first();
title.text("New Title");
// Modify paragraph content
Element paragraph = doc.select("p.content").first();
paragraph.text("New paragraph content");
System.out.println(doc.html());
2. Modifying HTML Content
You can also modify the inner HTML of elements:
Element paragraph = doc.select("p.content").first();
paragraph.html("<strong>Bold text</strong> and <em>italic text</em>");
// Or append HTML content
paragraph.append(" <span style='color: red;'>Additional content</span>");
3. Modifying Attributes
jsoup makes it easy to modify element attributes:
// Change attribute values
Element link = doc.select("a").first();
if (link != null) {
link.attr("href", "https://new-url.com");
link.attr("target", "_blank");
link.addClass("external-link");
}
// Remove attributes
link.removeAttr("onclick");
// Working with CSS classes
Element div = doc.select("div").first();
div.addClass("new-class");
div.removeClass("old-class");
div.toggleClass("toggle-class");
Advanced HTML Modifications
1. Adding New Elements
You can create and add new elements to the document:
// Create new elements
Element newDiv = doc.createElement("div");
newDiv.addClass("new-section");
newDiv.text("This is a new section");
// Add to document
Element body = doc.body();
body.appendChild(newDiv);
// Insert at specific position
Element firstParagraph = doc.select("p").first();
Element newParagraph = doc.createElement("p");
newParagraph.text("Inserted before first paragraph");
firstParagraph.before(newParagraph);
2. Complex Element Creation
For more complex HTML structures:
// Create a complex structure
Element article = doc.createElement("article");
article.addClass("blog-post");
Element header = doc.createElement("header");
Element title = doc.createElement("h2");
title.text("Article Title");
header.appendChild(title);
Element content = doc.createElement("div");
content.addClass("article-content");
content.html("<p>Article content goes here.</p>");
Element footer = doc.createElement("footer");
footer.html("<small>Published on <time>2024-01-01</time></small>");
// Assemble the structure
article.appendChild(header);
article.appendChild(content);
article.appendChild(footer);
// Add to document
doc.body().appendChild(article);
3. Removing Elements
jsoup provides several methods to remove elements:
// Remove specific elements
doc.select("script").remove(); // Remove all script tags
doc.select(".advertisement").remove(); // Remove ads
// Remove by ID
Element elementToRemove = doc.getElementById("unwanted-element");
if (elementToRemove != null) {
elementToRemove.remove();
}
// Clear content but keep the element
Element container = doc.select("div.container").first();
container.empty(); // Removes all child elements
Working with Forms
jsoup is particularly useful for modifying HTML forms:
// Modify form attributes
Element form = doc.select("form").first();
form.attr("action", "/new-endpoint");
form.attr("method", "POST");
// Modify input fields
Elements inputs = doc.select("input");
for (Element input : inputs) {
if ("text".equals(input.attr("type"))) {
input.attr("value", "default value");
input.attr("placeholder", "Enter text here");
}
}
// Add new form fields
Element newInput = doc.createElement("input");
newInput.attr("type", "hidden");
newInput.attr("name", "csrf_token");
newInput.attr("value", "abc123");
form.appendChild(newInput);
Practical Examples
1. URL Rewriting
Modify all links in a document:
Elements links = doc.select("a[href]");
for (Element link : links) {
String href = link.attr("href");
if (href.startsWith("/")) {
// Convert relative URLs to absolute
link.attr("href", "https://example.com" + href);
}
if (href.startsWith("http://")) {
// Upgrade HTTP to HTTPS
link.attr("href", href.replace("http://", "https://"));
}
}
2. Image Processing
Modify image attributes for optimization:
Elements images = doc.select("img");
for (Element img : images) {
// Add lazy loading
img.attr("loading", "lazy");
// Add responsive attributes
String src = img.attr("src");
if (!src.isEmpty()) {
img.attr("srcset", src + " 1x, " + src.replace(".jpg", "@2x.jpg") + " 2x");
}
// Ensure alt text exists
if (img.attr("alt").isEmpty()) {
img.attr("alt", "Image description");
}
}
3. Content Sanitization
Remove potentially harmful content:
// Remove dangerous elements
doc.select("script, object, embed, iframe").remove();
// Clean attributes
Elements allElements = doc.select("*");
for (Element element : allElements) {
// Remove event handlers
element.removeAttr("onclick");
element.removeAttr("onload");
element.removeAttr("onmouseover");
// Clean href attributes
String href = element.attr("href");
if (href.startsWith("javascript:")) {
element.removeAttr("href");
}
}
Best Practices and Performance Tips
1. Efficient Element Selection
Use specific selectors for better performance:
// Good: Specific selector
Element specificElement = doc.select("div.content > p.highlight").first();
// Less efficient: Broad selector with filtering
Elements allDivs = doc.select("div");
// ... then filter manually
2. Batch Operations
When making multiple modifications, work with the parsed document once:
// Parse once
Document doc = Jsoup.parse(htmlContent);
// Make all modifications
doc.select("a").attr("target", "_blank");
doc.select("img").attr("loading", "lazy");
doc.select("script").remove();
// Output once
String modifiedHtml = doc.html();
3. Memory Management
For large documents, consider using streaming approaches when possible:
// For very large documents, consider processing in chunks
// or using jsoup's streaming capabilities with Connection
Error Handling
Always include proper error handling when modifying HTML:
try {
Document doc = Jsoup.parse(htmlContent);
Element targetElement = doc.select("div.target").first();
if (targetElement != null) {
targetElement.text("Modified content");
} else {
System.out.println("Target element not found");
}
return doc.html();
} catch (Exception e) {
System.err.println("Error modifying HTML: " + e.getMessage());
return htmlContent; // Return original content on error
}
Integration with Web Scraping
When combined with web scraping workflows, jsoup becomes even more powerful. While tools like Puppeteer handle dynamic DOM interactions, jsoup excels at post-processing static HTML content:
// Typical web scraping + modification workflow
public String scrapeAndModify(String url) throws IOException {
// Fetch the page
Document doc = Jsoup.connect(url).get();
// Extract specific content
Element mainContent = doc.select("main.content").first();
// Modify the content
if (mainContent != null) {
// Remove ads and scripts
mainContent.select(".advertisement, script").remove();
// Update links
mainContent.select("a").attr("target", "_blank");
// Add custom styling
mainContent.addClass("processed-content");
}
return mainContent != null ? mainContent.html() : "";
}
Conclusion
jsoup provides a comprehensive set of tools for modifying HTML content in Java applications. From simple text changes to complex DOM manipulations, its jQuery-like selector syntax makes HTML modification intuitive and efficient. Whether you're building web scrapers, content processors, or HTML sanitizers, jsoup's modification capabilities offer the flexibility and performance needed for professional Java applications.
The key to successful HTML modification with jsoup lies in understanding CSS selectors, planning your modifications efficiently, and implementing proper error handling. Combined with its parsing and navigation features, jsoup's modification capabilities make it an essential tool for any Java developer working with HTML content.