Is there a way to use regular expressions with jsoup selectors?

Yes, there are two ways to use regular expressions with jsoup selectors:

Built-in `:matches` Pseudo-Selector

Jsoup provides a built-in :matches(regex) pseudo-selector that allows you to select elements whose text content matches a regular expression pattern:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class JsoupRegexExample {
    public static void main(String[] args) {
        String html = """
            <div>
                <p>Price: $29.99</p>
                <p>Contact: user@example.com</p>
                <p>Phone: (555) 123-4567</p>
                <p>Invalid data</p>
            </div>
            """;

        Document doc = Jsoup.parse(html);

        // Select elements containing email addresses
        Elements emails = doc.select("p:matches(.*@.*\\..*)");
        System.out.println("Email elements: " + emails.size());

        // Select elements containing prices
        Elements prices = doc.select("p:matches(.*\\$\\d+\\.\\d{2}.*)");
        System.out.println("Price elements: " + prices.size());

        // Select elements with phone numbers
        Elements phones = doc.select("p:matches(.*(\\d{3})\\s\\d{3}-\\d{4}.*)");
        System.out.println("Phone elements: " + phones.size());
    }
}

Important Notes for `:matches`

Double-escape backslashes in Java strings: \\d becomes \\\\d
The regex matches the entire text content of the element
Case-sensitive by default

Advanced Regex Integration with Java

For more complex scenarios, combine jsoup selectors with Java's regex capabilities:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
import java.util.ArrayList;
import java.util.List;

public class AdvancedJsoupRegex {
    public static void main(String[] args) {
        String html = """
            <table>
                <tr><td>Product A</td><td>$25.99</td><td>SKU-001</td></tr>
                <tr><td>Product B</td><td>$15.50</td><td>SKU-002</td></tr>
                <tr><td>Invalid</td><td>No price</td><td>INVALID</td></tr>
            </table>
            """;

        Document doc = Jsoup.parse(html);

        // Step 1: Use jsoup to narrow down selection
        Elements rows = doc.select("tr");

        // Step 2: Apply regex for validation and extraction
        Pattern pricePattern = Pattern.compile("\\$(\\d+\\.\\d{2})");
        Pattern skuPattern = Pattern.compile("SKU-(\\d{3})");

        List<Product> products = new ArrayList<>();

        for (Element row : rows) {
            Elements cells = row.select("td");
            if (cells.size() == 3) {
                String name = cells.get(0).text();
                String priceText = cells.get(1).text();
                String skuText = cells.get(2).text();

                Matcher priceMatcher = pricePattern.matcher(priceText);
                Matcher skuMatcher = skuPattern.matcher(skuText);

                if (priceMatcher.find() && skuMatcher.find()) {
                    double price = Double.parseDouble(priceMatcher.group(1));
                    int sku = Integer.parseInt(skuMatcher.group(1));
                    products.add(new Product(name, price, sku));
                    System.out.println("Valid product: " + name + " - $" + price);
                }
            }
        }
    }

    static class Product {
        String name;
        double price;
        int sku;

        Product(String name, double price, int sku) {
            this.name = name;
            this.price = price;
            this.sku = sku;
        }
    }
}

Attribute Matching with Regex

You can also apply regex to element attributes:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.util.regex.Pattern;

public class AttributeRegexExample {
    public static void main(String[] args) {
        String html = """
            <div>
                <img src="image1.jpg" alt="Photo">
                <img src="icon.png" alt="Icon">
                <img src="document.pdf" alt="Document">
                <a href="https://example.com">Link</a>
                <a href="mailto:test@example.com">Email</a>
            </div>
            """;

        Document doc = Jsoup.parse(html);

        // Select all images
        Elements images = doc.select("img");
        Pattern imagePattern = Pattern.compile(".*\\.(jpg|jpeg|png|gif)$", Pattern.CASE_INSENSITIVE);

        for (Element img : images) {
            String src = img.attr("src");
            if (imagePattern.matcher(src).matches()) {
                System.out.println("Valid image: " + src);
            }
        }

        // Select all links
        Elements links = doc.select("a[href]");
        Pattern emailPattern = Pattern.compile("^mailto:(.+@.+\\..+)$");

        for (Element link : links) {
            String href = link.attr("href");
            Matcher emailMatcher = emailPattern.matcher(href);
            if (emailMatcher.matches()) {
                String email = emailMatcher.group(1);
                System.out.println("Email found: " + email);
            }
        }
    }
}

Performance Tips

Pre-compile patterns for better performance:

   private static final Pattern EMAIL_PATTERN = Pattern.compile(".*@.*\\..+");

Use jsoup selectors first to narrow down the search space before applying regex
Consider jsoup's built-in pseudo-selectors like :contains() for simple text matching:

   // Instead of regex for simple contains
   Elements elements = doc.select("p:contains(error)");

Common Use Cases

Data validation: Verify formats (emails, phone numbers, prices)
Content extraction: Extract specific patterns from text
URL filtering: Match specific URL patterns in href attributes
Text processing: Clean and standardize extracted content

Regular expressions with jsoup provide powerful capabilities for precise HTML parsing and data extraction when CSS selectors alone aren't sufficient.

Table of contents

Is there a way to use regular expressions with jsoup selectors?

Built-in `:matches` Pseudo-Selector

Important Notes for `:matches`

Advanced Regex Integration with Java

Attribute Matching with Regex

Performance Tips

Common Use Cases

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Related Questions

How do I handle time-outs and retries with jsoup connections?

Get Started Now

Table of contents

Is there a way to use regular expressions with jsoup selectors?

Built-in :matches Pseudo-Selector

Important Notes for :matches

Advanced Regex Integration with Java

Attribute Matching with Regex

Performance Tips

Common Use Cases

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Related Questions

How do I handle time-outs and retries with jsoup connections?

Get Started Now

Built-in `:matches` Pseudo-Selector

Important Notes for `:matches`