Table of contents

How do I modify HTML content using jsoup?

jsoup is a powerful Java library that allows you to parse, manipulate, and modify HTML documents with ease. Unlike browser-based automation tools like Puppeteer for handling dynamic content, jsoup works directly with HTML structures, making it ideal for server-side HTML processing and web scraping applications.

Setting Up jsoup

First, add jsoup to your Java project:

Maven

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.16.1</version>
</dependency>

Gradle

implementation 'org.jsoup:jsoup:1.16.1'

Basic HTML Modification Operations

1. Modifying Text Content

The most common HTML modification is changing the text content of elements:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

// Parse HTML from string
String html = "<html><body><h1>Old Title</h1><p class='content'>Old paragraph</p></body></html>";
Document doc = Jsoup.parse(html);

// Modify text content
Element title = doc.select("h1").first();
title.text("New Title");

// Modify paragraph content
Element paragraph = doc.select("p.content").first();
paragraph.text("New paragraph content");

System.out.println(doc.html());

2. Modifying HTML Content

You can also modify the inner HTML of elements:

Element paragraph = doc.select("p.content").first();
paragraph.html("<strong>Bold text</strong> and <em>italic text</em>");

// Or append HTML content
paragraph.append(" <span style='color: red;'>Additional content</span>");

3. Modifying Attributes

jsoup makes it easy to modify element attributes:

// Change attribute values
Element link = doc.select("a").first();
if (link != null) {
    link.attr("href", "https://new-url.com");
    link.attr("target", "_blank");
    link.addClass("external-link");
}

// Remove attributes
link.removeAttr("onclick");

// Working with CSS classes
Element div = doc.select("div").first();
div.addClass("new-class");
div.removeClass("old-class");
div.toggleClass("toggle-class");

Advanced HTML Modifications

1. Adding New Elements

You can create and add new elements to the document:

// Create new elements
Element newDiv = doc.createElement("div");
newDiv.addClass("new-section");
newDiv.text("This is a new section");

// Add to document
Element body = doc.body();
body.appendChild(newDiv);

// Insert at specific position
Element firstParagraph = doc.select("p").first();
Element newParagraph = doc.createElement("p");
newParagraph.text("Inserted before first paragraph");
firstParagraph.before(newParagraph);

2. Complex Element Creation

For more complex HTML structures:

// Create a complex structure
Element article = doc.createElement("article");
article.addClass("blog-post");

Element header = doc.createElement("header");
Element title = doc.createElement("h2");
title.text("Article Title");
header.appendChild(title);

Element content = doc.createElement("div");
content.addClass("article-content");
content.html("<p>Article content goes here.</p>");

Element footer = doc.createElement("footer");
footer.html("<small>Published on <time>2024-01-01</time></small>");

// Assemble the structure
article.appendChild(header);
article.appendChild(content);
article.appendChild(footer);

// Add to document
doc.body().appendChild(article);

3. Removing Elements

jsoup provides several methods to remove elements:

// Remove specific elements
doc.select("script").remove(); // Remove all script tags
doc.select(".advertisement").remove(); // Remove ads

// Remove by ID
Element elementToRemove = doc.getElementById("unwanted-element");
if (elementToRemove != null) {
    elementToRemove.remove();
}

// Clear content but keep the element
Element container = doc.select("div.container").first();
container.empty(); // Removes all child elements

Working with Forms

jsoup is particularly useful for modifying HTML forms:

// Modify form attributes
Element form = doc.select("form").first();
form.attr("action", "/new-endpoint");
form.attr("method", "POST");

// Modify input fields
Elements inputs = doc.select("input");
for (Element input : inputs) {
    if ("text".equals(input.attr("type"))) {
        input.attr("value", "default value");
        input.attr("placeholder", "Enter text here");
    }
}

// Add new form fields
Element newInput = doc.createElement("input");
newInput.attr("type", "hidden");
newInput.attr("name", "csrf_token");
newInput.attr("value", "abc123");
form.appendChild(newInput);

Practical Examples

1. URL Rewriting

Modify all links in a document:

Elements links = doc.select("a[href]");
for (Element link : links) {
    String href = link.attr("href");
    if (href.startsWith("/")) {
        // Convert relative URLs to absolute
        link.attr("href", "https://example.com" + href);
    }
    if (href.startsWith("http://")) {
        // Upgrade HTTP to HTTPS
        link.attr("href", href.replace("http://", "https://"));
    }
}

2. Image Processing

Modify image attributes for optimization:

Elements images = doc.select("img");
for (Element img : images) {
    // Add lazy loading
    img.attr("loading", "lazy");

    // Add responsive attributes
    String src = img.attr("src");
    if (!src.isEmpty()) {
        img.attr("srcset", src + " 1x, " + src.replace(".jpg", "@2x.jpg") + " 2x");
    }

    // Ensure alt text exists
    if (img.attr("alt").isEmpty()) {
        img.attr("alt", "Image description");
    }
}

3. Content Sanitization

Remove potentially harmful content:

// Remove dangerous elements
doc.select("script, object, embed, iframe").remove();

// Clean attributes
Elements allElements = doc.select("*");
for (Element element : allElements) {
    // Remove event handlers
    element.removeAttr("onclick");
    element.removeAttr("onload");
    element.removeAttr("onmouseover");

    // Clean href attributes
    String href = element.attr("href");
    if (href.startsWith("javascript:")) {
        element.removeAttr("href");
    }
}

Best Practices and Performance Tips

1. Efficient Element Selection

Use specific selectors for better performance:

// Good: Specific selector
Element specificElement = doc.select("div.content > p.highlight").first();

// Less efficient: Broad selector with filtering
Elements allDivs = doc.select("div");
// ... then filter manually

2. Batch Operations

When making multiple modifications, work with the parsed document once:

// Parse once
Document doc = Jsoup.parse(htmlContent);

// Make all modifications
doc.select("a").attr("target", "_blank");
doc.select("img").attr("loading", "lazy");
doc.select("script").remove();

// Output once
String modifiedHtml = doc.html();

3. Memory Management

For large documents, consider using streaming approaches when possible:

// For very large documents, consider processing in chunks
// or using jsoup's streaming capabilities with Connection

Error Handling

Always include proper error handling when modifying HTML:

try {
    Document doc = Jsoup.parse(htmlContent);

    Element targetElement = doc.select("div.target").first();
    if (targetElement != null) {
        targetElement.text("Modified content");
    } else {
        System.out.println("Target element not found");
    }

    return doc.html();
} catch (Exception e) {
    System.err.println("Error modifying HTML: " + e.getMessage());
    return htmlContent; // Return original content on error
}

Integration with Web Scraping

When combined with web scraping workflows, jsoup becomes even more powerful. While tools like Puppeteer handle dynamic DOM interactions, jsoup excels at post-processing static HTML content:

// Typical web scraping + modification workflow
public String scrapeAndModify(String url) throws IOException {
    // Fetch the page
    Document doc = Jsoup.connect(url).get();

    // Extract specific content
    Element mainContent = doc.select("main.content").first();

    // Modify the content
    if (mainContent != null) {
        // Remove ads and scripts
        mainContent.select(".advertisement, script").remove();

        // Update links
        mainContent.select("a").attr("target", "_blank");

        // Add custom styling
        mainContent.addClass("processed-content");
    }

    return mainContent != null ? mainContent.html() : "";
}

Conclusion

jsoup provides a comprehensive set of tools for modifying HTML content in Java applications. From simple text changes to complex DOM manipulations, its jQuery-like selector syntax makes HTML modification intuitive and efficient. Whether you're building web scrapers, content processors, or HTML sanitizers, jsoup's modification capabilities offer the flexibility and performance needed for professional Java applications.

The key to successful HTML modification with jsoup lies in understanding CSS selectors, planning your modifications efficiently, and implementing proper error handling. Combined with its parsing and navigation features, jsoup's modification capabilities make it an essential tool for any Java developer working with HTML content.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon