How can I use jsoup to clean and sanitize HTML content?
HTML sanitization is a critical security practice when dealing with user-generated content or untrusted HTML sources. jsoup provides powerful built-in tools for cleaning and sanitizing HTML content, helping developers prevent XSS attacks and ensure safe HTML output. This guide covers comprehensive techniques for HTML sanitization using jsoup's Cleaner
and Safelist
(formerly Whitelist
) classes.
Understanding HTML Sanitization
HTML sanitization involves removing or modifying potentially dangerous HTML elements, attributes, and content while preserving safe markup. This process is essential when:
- Processing user-generated content
- Displaying HTML from external sources
- Preventing XSS (Cross-Site Scripting) attacks
- Ensuring content compliance with security policies
Basic HTML Cleaning with jsoup
Using Predefined Safelists
jsoup provides several predefined safelists for common use cases:
import org.jsoup.Jsoup;
import org.jsoup.safety.Safelist;
public class HtmlSanitizer {
public static void main(String[] args) {
String unsafeHtml = "<div><script>alert('XSS')</script><p>Safe content</p>" +
"<a href='javascript:alert(1)'>Click me</a></div>";
// Basic safelist - allows only simple text formatting
String cleanBasic = Jsoup.clean(unsafeHtml, Safelist.basic());
System.out.println("Basic: " + cleanBasic);
// Output: <p>Safe content</p><a>Click me</a>
// Simpletext safelist - strips all HTML tags
String cleanSimple = Jsoup.clean(unsafeHtml, Safelist.simpleText());
System.out.println("Simple: " + cleanSimple);
// Output: Safe contentClick me
// BasicWithImages safelist - allows basic formatting plus images
String cleanWithImages = Jsoup.clean(unsafeHtml, Safelist.basicWithImages());
System.out.println("With Images: " + cleanWithImages);
// Output: <p>Safe content</p><a>Click me</a>
// Relaxed safelist - allows more HTML elements
String cleanRelaxed = Jsoup.clean(unsafeHtml, Safelist.relaxed());
System.out.println("Relaxed: " + cleanRelaxed);
// Output: <div><p>Safe content</p><a>Click me</a></div>
}
}
Available Predefined Safelists
Safelist.none()
- Removes all HTML, returns plain textSafelist.simpleText()
- Allows onlyb
,em
,i
,strong
,u
tagsSafelist.basic()
- Allows basic text formatting and linksSafelist.basicWithImages()
- Basic plusimg
tagsSafelist.relaxed()
- Comprehensive list for rich text content
Creating Custom Safelists
For specific requirements, you can create custom safelists:
import org.jsoup.safety.Safelist;
public class CustomSafelist {
public static Safelist createCustomSafelist() {
return new Safelist()
// Allow specific tags
.addTags("p", "div", "span", "h1", "h2", "h3", "h4", "h5", "h6")
.addTags("strong", "em", "b", "i", "u")
.addTags("ul", "ol", "li")
.addTags("a", "img")
.addTags("table", "thead", "tbody", "tr", "td", "th")
// Allow specific attributes
.addAttributes("a", "href", "title")
.addAttributes("img", "src", "alt", "title", "width", "height")
.addAttributes("div", "class", "id")
.addAttributes("span", "class")
.addAttributes("table", "class")
// Restrict protocols for links and images
.addProtocols("a", "href", "http", "https", "mailto")
.addProtocols("img", "src", "http", "https", "data")
// Preserve relative links
.preserveRelativeLinks(true);
}
public static void main(String[] args) {
String html = "<div class='content'>" +
"<a href='https://example.com'>Safe link</a>" +
"<a href='javascript:alert(1)'>Unsafe link</a>" +
"<img src='https://example.com/image.jpg' alt='Image'>" +
"<script>alert('XSS')</script>" +
"</div>";
Safelist customSafelist = createCustomSafelist();
String cleanHtml = Jsoup.clean(html, customSafelist);
System.out.println(cleanHtml);
// Output: <div class="content"><a href="https://example.com">Safe link</a>
// <a>Unsafe link</a><img src="https://example.com/image.jpg" alt="Image"></div>
}
}
Advanced Cleaning Techniques
Using Cleaner Class Directly
For more control over the cleaning process, use the Cleaner
class:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.safety.Cleaner;
import org.jsoup.safety.Safelist;
public class AdvancedCleaning {
public static void cleanDocument() {
String html = "<html><head><title>Test</title></head>" +
"<body><div>Content</div><script>alert('XSS')</script></body></html>";
Document dirtyDoc = Jsoup.parse(html);
Safelist safelist = Safelist.relaxed();
Cleaner cleaner = new Cleaner(safelist);
Document cleanDoc = cleaner.clean(dirtyDoc);
System.out.println(cleanDoc.html());
}
public static boolean isValid(String html, Safelist safelist) {
Cleaner cleaner = new Cleaner(safelist);
Document dirtyDoc = Jsoup.parse(html);
return cleaner.isValid(dirtyDoc);
}
public static void main(String[] args) {
String testHtml = "<p>Safe content</p><script>alert('XSS')</script>";
// Check if HTML is valid against safelist
boolean isValid = isValid(testHtml, Safelist.basic());
System.out.println("Is valid: " + isValid); // Output: false
cleanDocument();
}
}
Removing Specific Elements and Attributes
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class ElementRemoval {
public static String removeSpecificElements(String html) {
Document doc = Jsoup.parse(html);
// Remove all script tags
doc.select("script").remove();
// Remove all elements with specific classes
doc.select(".advertisement").remove();
doc.select(".tracking").remove();
// Remove all style attributes
Elements elementsWithStyle = doc.select("*[style]");
for (Element element : elementsWithStyle) {
element.removeAttr("style");
}
// Remove all onclick attributes
Elements elementsWithOnclick = doc.select("*[onclick]");
for (Element element : elementsWithOnclick) {
element.removeAttr("onclick");
}
// Remove dangerous href attributes
Elements links = doc.select("a[href]");
for (Element link : links) {
String href = link.attr("href");
if (href.startsWith("javascript:") || href.startsWith("data:")) {
link.removeAttr("href");
}
}
return doc.html();
}
public static void main(String[] args) {
String html = "<div class='content'>" +
"<p style='color: red;'>Content</p>" +
"<div class='advertisement'>Ad content</div>" +
"<a href='javascript:alert(1)' onclick='hack()'>Link</a>" +
"<script>alert('XSS')</script>" +
"</div>";
String cleaned = removeSpecificElements(html);
System.out.println(cleaned);
}
}
Content-Specific Sanitization
Sanitizing Rich Text Content
For rich text editors and content management systems:
public class RichTextSanitizer {
public static Safelist createRichTextSafelist() {
return Safelist.relaxed()
// Add additional formatting tags
.addTags("code", "pre", "blockquote", "cite")
.addTags("del", "ins", "mark", "sub", "sup")
// Add more table-related tags
.addTags("caption", "colgroup", "col")
// Allow more attributes for styling
.addAttributes("div", "class", "id", "data-*")
.addAttributes("span", "class", "id")
.addAttributes("p", "class", "id")
.addAttributes("h1,h2,h3,h4,h5,h6", "class", "id")
// Allow data attributes (be careful with this)
.addAttributes(":all", "data-*")
// Preserve whitespace in code blocks
.preserveRelativeLinks(true);
}
public static String sanitizeRichText(String html) {
// First pass: use custom safelist
String cleaned = Jsoup.clean(html, createRichTextSafelist());
// Second pass: additional processing
Document doc = Jsoup.parse(cleaned);
// Ensure code blocks are properly formatted
Elements codeBlocks = doc.select("pre code");
for (Element code : codeBlocks) {
// Remove any remaining dangerous content
code.select("script").remove();
}
return doc.body().html();
}
}
Sanitizing User Comments
For user-generated comments with stricter rules:
public class CommentSanitizer {
public static Safelist createCommentSafelist() {
return new Safelist()
.addTags("p", "br")
.addTags("strong", "em", "b", "i")
.addTags("a")
.addAttributes("a", "href")
.addProtocols("a", "href", "http", "https")
.preserveRelativeLinks(false); // Disable relative links
}
public static String sanitizeComment(String comment) {
// Clean with strict safelist
String cleaned = Jsoup.clean(comment, createCommentSafelist());
// Additional validation
Document doc = Jsoup.parse(cleaned);
// Limit link count to prevent spam
Elements links = doc.select("a");
if (links.size() > 3) {
// Remove excess links
for (int i = 3; i < links.size(); i++) {
links.get(i).unwrap(); // Remove tag but keep text
}
}
// Ensure reasonable length
String text = doc.text();
if (text.length() > 1000) {
// Truncate if too long
return text.substring(0, 1000) + "...";
}
return doc.body().html();
}
}
Performance Considerations
Caching Cleaner Instances
For high-volume applications, cache Cleaner
instances:
import java.util.concurrent.ConcurrentHashMap;
public class CleanerCache {
private static final ConcurrentHashMap<String, Cleaner> cleanerCache =
new ConcurrentHashMap<>();
public static Cleaner getCleaner(String type) {
return cleanerCache.computeIfAbsent(type, k -> {
switch (k) {
case "basic":
return new Cleaner(Safelist.basic());
case "relaxed":
return new Cleaner(Safelist.relaxed());
case "comment":
return new Cleaner(CommentSanitizer.createCommentSafelist());
default:
return new Cleaner(Safelist.basic());
}
});
}
public static String cleanHtml(String html, String cleanerType) {
Cleaner cleaner = getCleaner(cleanerType);
Document dirtyDoc = Jsoup.parse(html);
Document cleanDoc = cleaner.clean(dirtyDoc);
return cleanDoc.body().html();
}
}
Security Best Practices
Defense in Depth
When working with HTML sanitization, implementing multiple layers of security is crucial. While jsoup provides excellent HTML cleaning capabilities, combining it with other security measures creates a more robust defense system:
public class SecurityBestPractices {
public static String sanitizeWithValidation(String html) {
// 1. Input validation
if (html == null || html.trim().isEmpty()) {
return "";
}
// 2. Length limitation
if (html.length() > 10000) {
throw new IllegalArgumentException("HTML content too large");
}
// 3. Basic pattern checking (optional pre-filter)
if (html.toLowerCase().contains("<script") ||
html.toLowerCase().contains("javascript:")) {
// Log suspicious content
System.out.println("Suspicious content detected");
}
// 4. jsoup cleaning
String cleaned = Jsoup.clean(html, Safelist.relaxed());
// 5. Post-processing validation
Document doc = Jsoup.parse(cleaned);
if (doc.select("a").size() > 10) {
// Additional link validation
validateLinks(doc);
}
return cleaned;
}
private static void validateLinks(Document doc) {
Elements links = doc.select("a[href]");
for (Element link : links) {
String href = link.attr("href");
// Additional URL validation logic
if (!isValidUrl(href)) {
link.removeAttr("href");
}
}
}
private static boolean isValidUrl(String url) {
// Implement URL validation logic
return url.startsWith("http://") || url.startsWith("https://");
}
}
Common Pitfalls and Solutions
Handling Base64 Images
When dealing with data URLs for images, be cautious about allowing them:
public class Base64ImageHandler {
public static Safelist createImageSafelist() {
return Safelist.basicWithImages()
.addProtocols("img", "src", "http", "https")
// Be careful with data URLs - validate them properly
.addProtocols("img", "src", "data");
}
public static String sanitizeWithImageValidation(String html) {
String cleaned = Jsoup.clean(html, createImageSafelist());
Document doc = Jsoup.parse(cleaned);
// Validate data URLs
Elements dataImages = doc.select("img[src^=data:]");
for (Element img : dataImages) {
String src = img.attr("src");
if (!isValidDataUrl(src)) {
img.remove();
}
}
return doc.body().html();
}
private static boolean isValidDataUrl(String dataUrl) {
// Only allow image data URLs with specific formats
return dataUrl.startsWith("data:image/") &&
(dataUrl.contains("data:image/jpeg") ||
dataUrl.contains("data:image/png") ||
dataUrl.contains("data:image/gif"));
}
}
Integration with Web Applications
When integrating HTML sanitization into web applications, consider using jsoup alongside frameworks. For complex web scraping scenarios that require handling dynamic content that loads after page load, you might need to combine jsoup with browser automation tools.
Additionally, if you're working with applications that need to handle authentication before accessing content, ensure that your sanitization process accounts for session-based content that might contain user-specific data.
Conclusion
jsoup's HTML sanitization capabilities provide a robust foundation for cleaning untrusted HTML content. By using predefined safelists for common scenarios and creating custom safelists for specific requirements, developers can effectively prevent XSS attacks while preserving necessary HTML formatting.
Key takeaways for effective HTML sanitization:
- Choose appropriate safelists based on your content requirements
- Create custom safelists for specific use cases
- Implement defense in depth with multiple validation layers
- Cache cleaner instances for better performance
- Validate content both before and after cleaning
- Test thoroughly with various input scenarios
Remember that HTML sanitization is just one part of a comprehensive security strategy. Always combine it with proper input validation, output encoding, and other security measures appropriate for your application's threat model.