Yes, Jsoup provides a utility for cleaning HTML to prevent XSS (Cross-Site Scripting) attacks. The Jsoup.clean()
method allows you to clean an HTML document using a specified whitelist of allowed tags and attributes.
The Whitelist
class in Jsoup defines several preset whitelist configurations that you can use:
Whitelist.none()
: Allows no tags or attributes (resulting in only text).Whitelist.simpleText()
: Allows basic formatting tags likeb
,em
, andstrong
.Whitelist.basic()
: Allows a set of tags and attributes useful for simple formatting.Whitelist.basicWithImages()
: Same asWhitelist.basic()
, but also allowsimg
tags withsrc
andalt
attributes.Whitelist.relaxed()
: Allows a more extensive set of HTML tags and attributes.
You can also define a custom Whitelist
if the presets do not fit your needs.
Below is an example of how to use Jsoup's clean()
method to sanitize HTML:
import org.jsoup.Jsoup;
import org.jsoup.safety.Whitelist;
import org.jsoup.nodes.Document;
public class JsoupCleanExample {
public static void main(String[] args) {
// Example of a potentially unsafe HTML (with a script tag)
String unsafeHtml = "<p><a href='http://example.com/' onclick='stealCookies()'>Link</a></p>" +
"<script>alert('XSS');</script>";
// Use Jsoup's clean method with a Whitelist preset
String safeHtml = Jsoup.clean(unsafeHtml, Whitelist.basic());
// Output the cleaned HTML
System.out.println(safeHtml);
}
}
This example cleans the unsafeHtml
string to remove any tags and attributes that are not allowed by the Whitelist.basic()
preset. The resulting safeHtml
string will be free of the <script>
tag and other potential XSS payloads.
Remember that the effectiveness of HTML sanitization depends on the specific whitelist you use and how it matches your application's requirements. It's important to carefully consider what HTML elements and attributes you allow to ensure that your cleaning process is as secure as possible.
For web applications that require stricter content security policies, it's recommended to use Whitelist.none()
and then selectively enable only the most essential tags and attributes your application truly needs.