Table of contents

How can I use jsoup to clean and sanitize HTML content?

HTML sanitization is a critical security practice when dealing with user-generated content or untrusted HTML sources. jsoup provides powerful built-in tools for cleaning and sanitizing HTML content, helping developers prevent XSS attacks and ensure safe HTML output. This guide covers comprehensive techniques for HTML sanitization using jsoup's Cleaner and Safelist (formerly Whitelist) classes.

Understanding HTML Sanitization

HTML sanitization involves removing or modifying potentially dangerous HTML elements, attributes, and content while preserving safe markup. This process is essential when:

  • Processing user-generated content
  • Displaying HTML from external sources
  • Preventing XSS (Cross-Site Scripting) attacks
  • Ensuring content compliance with security policies

Basic HTML Cleaning with jsoup

Using Predefined Safelists

jsoup provides several predefined safelists for common use cases:

import org.jsoup.Jsoup;
import org.jsoup.safety.Safelist;

public class HtmlSanitizer {
    public static void main(String[] args) {
        String unsafeHtml = "<div><script>alert('XSS')</script><p>Safe content</p>" +
                           "<a href='javascript:alert(1)'>Click me</a></div>";

        // Basic safelist - allows only simple text formatting
        String cleanBasic = Jsoup.clean(unsafeHtml, Safelist.basic());
        System.out.println("Basic: " + cleanBasic);
        // Output: <p>Safe content</p><a>Click me</a>

        // Simpletext safelist - strips all HTML tags
        String cleanSimple = Jsoup.clean(unsafeHtml, Safelist.simpleText());
        System.out.println("Simple: " + cleanSimple);
        // Output: Safe contentClick me

        // BasicWithImages safelist - allows basic formatting plus images
        String cleanWithImages = Jsoup.clean(unsafeHtml, Safelist.basicWithImages());
        System.out.println("With Images: " + cleanWithImages);
        // Output: <p>Safe content</p><a>Click me</a>

        // Relaxed safelist - allows more HTML elements
        String cleanRelaxed = Jsoup.clean(unsafeHtml, Safelist.relaxed());
        System.out.println("Relaxed: " + cleanRelaxed);
        // Output: <div><p>Safe content</p><a>Click me</a></div>
    }
}

Available Predefined Safelists

  1. Safelist.none() - Removes all HTML, returns plain text
  2. Safelist.simpleText() - Allows only b, em, i, strong, u tags
  3. Safelist.basic() - Allows basic text formatting and links
  4. Safelist.basicWithImages() - Basic plus img tags
  5. Safelist.relaxed() - Comprehensive list for rich text content

Creating Custom Safelists

For specific requirements, you can create custom safelists:

import org.jsoup.safety.Safelist;

public class CustomSafelist {
    public static Safelist createCustomSafelist() {
        return new Safelist()
            // Allow specific tags
            .addTags("p", "div", "span", "h1", "h2", "h3", "h4", "h5", "h6")
            .addTags("strong", "em", "b", "i", "u")
            .addTags("ul", "ol", "li")
            .addTags("a", "img")
            .addTags("table", "thead", "tbody", "tr", "td", "th")

            // Allow specific attributes
            .addAttributes("a", "href", "title")
            .addAttributes("img", "src", "alt", "title", "width", "height")
            .addAttributes("div", "class", "id")
            .addAttributes("span", "class")
            .addAttributes("table", "class")

            // Restrict protocols for links and images
            .addProtocols("a", "href", "http", "https", "mailto")
            .addProtocols("img", "src", "http", "https", "data")

            // Preserve relative links
            .preserveRelativeLinks(true);
    }

    public static void main(String[] args) {
        String html = "<div class='content'>" +
                     "<a href='https://example.com'>Safe link</a>" +
                     "<a href='javascript:alert(1)'>Unsafe link</a>" +
                     "<img src='https://example.com/image.jpg' alt='Image'>" +
                     "<script>alert('XSS')</script>" +
                     "</div>";

        Safelist customSafelist = createCustomSafelist();
        String cleanHtml = Jsoup.clean(html, customSafelist);
        System.out.println(cleanHtml);
        // Output: <div class="content"><a href="https://example.com">Safe link</a>
        //         <a>Unsafe link</a><img src="https://example.com/image.jpg" alt="Image"></div>
    }
}

Advanced Cleaning Techniques

Using Cleaner Class Directly

For more control over the cleaning process, use the Cleaner class:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.safety.Cleaner;
import org.jsoup.safety.Safelist;

public class AdvancedCleaning {
    public static void cleanDocument() {
        String html = "<html><head><title>Test</title></head>" +
                     "<body><div>Content</div><script>alert('XSS')</script></body></html>";

        Document dirtyDoc = Jsoup.parse(html);
        Safelist safelist = Safelist.relaxed();
        Cleaner cleaner = new Cleaner(safelist);

        Document cleanDoc = cleaner.clean(dirtyDoc);
        System.out.println(cleanDoc.html());
    }

    public static boolean isValid(String html, Safelist safelist) {
        Cleaner cleaner = new Cleaner(safelist);
        Document dirtyDoc = Jsoup.parse(html);
        return cleaner.isValid(dirtyDoc);
    }

    public static void main(String[] args) {
        String testHtml = "<p>Safe content</p><script>alert('XSS')</script>";

        // Check if HTML is valid against safelist
        boolean isValid = isValid(testHtml, Safelist.basic());
        System.out.println("Is valid: " + isValid); // Output: false

        cleanDocument();
    }
}

Removing Specific Elements and Attributes

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class ElementRemoval {
    public static String removeSpecificElements(String html) {
        Document doc = Jsoup.parse(html);

        // Remove all script tags
        doc.select("script").remove();

        // Remove all elements with specific classes
        doc.select(".advertisement").remove();
        doc.select(".tracking").remove();

        // Remove all style attributes
        Elements elementsWithStyle = doc.select("*[style]");
        for (Element element : elementsWithStyle) {
            element.removeAttr("style");
        }

        // Remove all onclick attributes
        Elements elementsWithOnclick = doc.select("*[onclick]");
        for (Element element : elementsWithOnclick) {
            element.removeAttr("onclick");
        }

        // Remove dangerous href attributes
        Elements links = doc.select("a[href]");
        for (Element link : links) {
            String href = link.attr("href");
            if (href.startsWith("javascript:") || href.startsWith("data:")) {
                link.removeAttr("href");
            }
        }

        return doc.html();
    }

    public static void main(String[] args) {
        String html = "<div class='content'>" +
                     "<p style='color: red;'>Content</p>" +
                     "<div class='advertisement'>Ad content</div>" +
                     "<a href='javascript:alert(1)' onclick='hack()'>Link</a>" +
                     "<script>alert('XSS')</script>" +
                     "</div>";

        String cleaned = removeSpecificElements(html);
        System.out.println(cleaned);
    }
}

Content-Specific Sanitization

Sanitizing Rich Text Content

For rich text editors and content management systems:

public class RichTextSanitizer {
    public static Safelist createRichTextSafelist() {
        return Safelist.relaxed()
            // Add additional formatting tags
            .addTags("code", "pre", "blockquote", "cite")
            .addTags("del", "ins", "mark", "sub", "sup")

            // Add more table-related tags
            .addTags("caption", "colgroup", "col")

            // Allow more attributes for styling
            .addAttributes("div", "class", "id", "data-*")
            .addAttributes("span", "class", "id")
            .addAttributes("p", "class", "id")
            .addAttributes("h1,h2,h3,h4,h5,h6", "class", "id")

            // Allow data attributes (be careful with this)
            .addAttributes(":all", "data-*")

            // Preserve whitespace in code blocks
            .preserveRelativeLinks(true);
    }

    public static String sanitizeRichText(String html) {
        // First pass: use custom safelist
        String cleaned = Jsoup.clean(html, createRichTextSafelist());

        // Second pass: additional processing
        Document doc = Jsoup.parse(cleaned);

        // Ensure code blocks are properly formatted
        Elements codeBlocks = doc.select("pre code");
        for (Element code : codeBlocks) {
            // Remove any remaining dangerous content
            code.select("script").remove();
        }

        return doc.body().html();
    }
}

Sanitizing User Comments

For user-generated comments with stricter rules:

public class CommentSanitizer {
    public static Safelist createCommentSafelist() {
        return new Safelist()
            .addTags("p", "br")
            .addTags("strong", "em", "b", "i")
            .addTags("a")
            .addAttributes("a", "href")
            .addProtocols("a", "href", "http", "https")
            .preserveRelativeLinks(false); // Disable relative links
    }

    public static String sanitizeComment(String comment) {
        // Clean with strict safelist
        String cleaned = Jsoup.clean(comment, createCommentSafelist());

        // Additional validation
        Document doc = Jsoup.parse(cleaned);

        // Limit link count to prevent spam
        Elements links = doc.select("a");
        if (links.size() > 3) {
            // Remove excess links
            for (int i = 3; i < links.size(); i++) {
                links.get(i).unwrap(); // Remove tag but keep text
            }
        }

        // Ensure reasonable length
        String text = doc.text();
        if (text.length() > 1000) {
            // Truncate if too long
            return text.substring(0, 1000) + "...";
        }

        return doc.body().html();
    }
}

Performance Considerations

Caching Cleaner Instances

For high-volume applications, cache Cleaner instances:

import java.util.concurrent.ConcurrentHashMap;

public class CleanerCache {
    private static final ConcurrentHashMap<String, Cleaner> cleanerCache = 
        new ConcurrentHashMap<>();

    public static Cleaner getCleaner(String type) {
        return cleanerCache.computeIfAbsent(type, k -> {
            switch (k) {
                case "basic":
                    return new Cleaner(Safelist.basic());
                case "relaxed":
                    return new Cleaner(Safelist.relaxed());
                case "comment":
                    return new Cleaner(CommentSanitizer.createCommentSafelist());
                default:
                    return new Cleaner(Safelist.basic());
            }
        });
    }

    public static String cleanHtml(String html, String cleanerType) {
        Cleaner cleaner = getCleaner(cleanerType);
        Document dirtyDoc = Jsoup.parse(html);
        Document cleanDoc = cleaner.clean(dirtyDoc);
        return cleanDoc.body().html();
    }
}

Security Best Practices

Defense in Depth

When working with HTML sanitization, implementing multiple layers of security is crucial. While jsoup provides excellent HTML cleaning capabilities, combining it with other security measures creates a more robust defense system:

public class SecurityBestPractices {
    public static String sanitizeWithValidation(String html) {
        // 1. Input validation
        if (html == null || html.trim().isEmpty()) {
            return "";
        }

        // 2. Length limitation
        if (html.length() > 10000) {
            throw new IllegalArgumentException("HTML content too large");
        }

        // 3. Basic pattern checking (optional pre-filter)
        if (html.toLowerCase().contains("<script") || 
            html.toLowerCase().contains("javascript:")) {
            // Log suspicious content
            System.out.println("Suspicious content detected");
        }

        // 4. jsoup cleaning
        String cleaned = Jsoup.clean(html, Safelist.relaxed());

        // 5. Post-processing validation
        Document doc = Jsoup.parse(cleaned);
        if (doc.select("a").size() > 10) {
            // Additional link validation
            validateLinks(doc);
        }

        return cleaned;
    }

    private static void validateLinks(Document doc) {
        Elements links = doc.select("a[href]");
        for (Element link : links) {
            String href = link.attr("href");
            // Additional URL validation logic
            if (!isValidUrl(href)) {
                link.removeAttr("href");
            }
        }
    }

    private static boolean isValidUrl(String url) {
        // Implement URL validation logic
        return url.startsWith("http://") || url.startsWith("https://");
    }
}

Common Pitfalls and Solutions

Handling Base64 Images

When dealing with data URLs for images, be cautious about allowing them:

public class Base64ImageHandler {
    public static Safelist createImageSafelist() {
        return Safelist.basicWithImages()
            .addProtocols("img", "src", "http", "https")
            // Be careful with data URLs - validate them properly
            .addProtocols("img", "src", "data");
    }

    public static String sanitizeWithImageValidation(String html) {
        String cleaned = Jsoup.clean(html, createImageSafelist());
        Document doc = Jsoup.parse(cleaned);

        // Validate data URLs
        Elements dataImages = doc.select("img[src^=data:]");
        for (Element img : dataImages) {
            String src = img.attr("src");
            if (!isValidDataUrl(src)) {
                img.remove();
            }
        }

        return doc.body().html();
    }

    private static boolean isValidDataUrl(String dataUrl) {
        // Only allow image data URLs with specific formats
        return dataUrl.startsWith("data:image/") && 
               (dataUrl.contains("data:image/jpeg") || 
                dataUrl.contains("data:image/png") || 
                dataUrl.contains("data:image/gif"));
    }
}

Integration with Web Applications

When integrating HTML sanitization into web applications, consider using jsoup alongside frameworks. For complex web scraping scenarios that require handling dynamic content that loads after page load, you might need to combine jsoup with browser automation tools.

Additionally, if you're working with applications that need to handle authentication before accessing content, ensure that your sanitization process accounts for session-based content that might contain user-specific data.

Conclusion

jsoup's HTML sanitization capabilities provide a robust foundation for cleaning untrusted HTML content. By using predefined safelists for common scenarios and creating custom safelists for specific requirements, developers can effectively prevent XSS attacks while preserving necessary HTML formatting.

Key takeaways for effective HTML sanitization:

  1. Choose appropriate safelists based on your content requirements
  2. Create custom safelists for specific use cases
  3. Implement defense in depth with multiple validation layers
  4. Cache cleaner instances for better performance
  5. Validate content both before and after cleaning
  6. Test thoroughly with various input scenarios

Remember that HTML sanitization is just one part of a comprehensive security strategy. Always combine it with proper input validation, output encoding, and other security measures appropriate for your application's threat model.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon