Yes, there are two ways to use regular expressions with jsoup selectors:
Built-in :matches
Pseudo-Selector
Jsoup provides a built-in :matches(regex)
pseudo-selector that allows you to select elements whose text content matches a regular expression pattern:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class JsoupRegexExample {
public static void main(String[] args) {
String html = """
<div>
<p>Price: $29.99</p>
<p>Contact: user@example.com</p>
<p>Phone: (555) 123-4567</p>
<p>Invalid data</p>
</div>
""";
Document doc = Jsoup.parse(html);
// Select elements containing email addresses
Elements emails = doc.select("p:matches(.*@.*\\..*)");
System.out.println("Email elements: " + emails.size());
// Select elements containing prices
Elements prices = doc.select("p:matches(.*\\$\\d+\\.\\d{2}.*)");
System.out.println("Price elements: " + prices.size());
// Select elements with phone numbers
Elements phones = doc.select("p:matches(.*(\\d{3})\\s\\d{3}-\\d{4}.*)");
System.out.println("Phone elements: " + phones.size());
}
}
Important Notes for :matches
- Double-escape backslashes in Java strings:
\\d
becomes\\\\d
- The regex matches the entire text content of the element
- Case-sensitive by default
Advanced Regex Integration with Java
For more complex scenarios, combine jsoup selectors with Java's regex capabilities:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
import java.util.ArrayList;
import java.util.List;
public class AdvancedJsoupRegex {
public static void main(String[] args) {
String html = """
<table>
<tr><td>Product A</td><td>$25.99</td><td>SKU-001</td></tr>
<tr><td>Product B</td><td>$15.50</td><td>SKU-002</td></tr>
<tr><td>Invalid</td><td>No price</td><td>INVALID</td></tr>
</table>
""";
Document doc = Jsoup.parse(html);
// Step 1: Use jsoup to narrow down selection
Elements rows = doc.select("tr");
// Step 2: Apply regex for validation and extraction
Pattern pricePattern = Pattern.compile("\\$(\\d+\\.\\d{2})");
Pattern skuPattern = Pattern.compile("SKU-(\\d{3})");
List<Product> products = new ArrayList<>();
for (Element row : rows) {
Elements cells = row.select("td");
if (cells.size() == 3) {
String name = cells.get(0).text();
String priceText = cells.get(1).text();
String skuText = cells.get(2).text();
Matcher priceMatcher = pricePattern.matcher(priceText);
Matcher skuMatcher = skuPattern.matcher(skuText);
if (priceMatcher.find() && skuMatcher.find()) {
double price = Double.parseDouble(priceMatcher.group(1));
int sku = Integer.parseInt(skuMatcher.group(1));
products.add(new Product(name, price, sku));
System.out.println("Valid product: " + name + " - $" + price);
}
}
}
}
static class Product {
String name;
double price;
int sku;
Product(String name, double price, int sku) {
this.name = name;
this.price = price;
this.sku = sku;
}
}
}
Attribute Matching with Regex
You can also apply regex to element attributes:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.util.regex.Pattern;
public class AttributeRegexExample {
public static void main(String[] args) {
String html = """
<div>
<img src="image1.jpg" alt="Photo">
<img src="icon.png" alt="Icon">
<img src="document.pdf" alt="Document">
<a href="https://example.com">Link</a>
<a href="mailto:test@example.com">Email</a>
</div>
""";
Document doc = Jsoup.parse(html);
// Select all images
Elements images = doc.select("img");
Pattern imagePattern = Pattern.compile(".*\\.(jpg|jpeg|png|gif)$", Pattern.CASE_INSENSITIVE);
for (Element img : images) {
String src = img.attr("src");
if (imagePattern.matcher(src).matches()) {
System.out.println("Valid image: " + src);
}
}
// Select all links
Elements links = doc.select("a[href]");
Pattern emailPattern = Pattern.compile("^mailto:(.+@.+\\..+)$");
for (Element link : links) {
String href = link.attr("href");
Matcher emailMatcher = emailPattern.matcher(href);
if (emailMatcher.matches()) {
String email = emailMatcher.group(1);
System.out.println("Email found: " + email);
}
}
}
}
Performance Tips
- Pre-compile patterns for better performance:
private static final Pattern EMAIL_PATTERN = Pattern.compile(".*@.*\\..+");
Use jsoup selectors first to narrow down the search space before applying regex
Consider jsoup's built-in pseudo-selectors like
:contains()
for simple text matching:
// Instead of regex for simple contains
Elements elements = doc.select("p:contains(error)");
Common Use Cases
- Data validation: Verify formats (emails, phone numbers, prices)
- Content extraction: Extract specific patterns from text
- URL filtering: Match specific URL patterns in href attributes
- Text processing: Clean and standardize extracted content
Regular expressions with jsoup provide powerful capabilities for precise HTML parsing and data extraction when CSS selectors alone aren't sufficient.