Table of contents

Is there a way to use regular expressions with jsoup selectors?

Yes, there are two ways to use regular expressions with jsoup selectors:

Built-in :matches Pseudo-Selector

Jsoup provides a built-in :matches(regex) pseudo-selector that allows you to select elements whose text content matches a regular expression pattern:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class JsoupRegexExample {
    public static void main(String[] args) {
        String html = """
            <div>
                <p>Price: $29.99</p>
                <p>Contact: user@example.com</p>
                <p>Phone: (555) 123-4567</p>
                <p>Invalid data</p>
            </div>
            """;

        Document doc = Jsoup.parse(html);

        // Select elements containing email addresses
        Elements emails = doc.select("p:matches(.*@.*\\..*)");
        System.out.println("Email elements: " + emails.size());

        // Select elements containing prices
        Elements prices = doc.select("p:matches(.*\\$\\d+\\.\\d{2}.*)");
        System.out.println("Price elements: " + prices.size());

        // Select elements with phone numbers
        Elements phones = doc.select("p:matches(.*(\\d{3})\\s\\d{3}-\\d{4}.*)");
        System.out.println("Phone elements: " + phones.size());
    }
}

Important Notes for :matches

  1. Double-escape backslashes in Java strings: \\d becomes \\\\d
  2. The regex matches the entire text content of the element
  3. Case-sensitive by default

Advanced Regex Integration with Java

For more complex scenarios, combine jsoup selectors with Java's regex capabilities:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
import java.util.ArrayList;
import java.util.List;

public class AdvancedJsoupRegex {
    public static void main(String[] args) {
        String html = """
            <table>
                <tr><td>Product A</td><td>$25.99</td><td>SKU-001</td></tr>
                <tr><td>Product B</td><td>$15.50</td><td>SKU-002</td></tr>
                <tr><td>Invalid</td><td>No price</td><td>INVALID</td></tr>
            </table>
            """;

        Document doc = Jsoup.parse(html);

        // Step 1: Use jsoup to narrow down selection
        Elements rows = doc.select("tr");

        // Step 2: Apply regex for validation and extraction
        Pattern pricePattern = Pattern.compile("\\$(\\d+\\.\\d{2})");
        Pattern skuPattern = Pattern.compile("SKU-(\\d{3})");

        List<Product> products = new ArrayList<>();

        for (Element row : rows) {
            Elements cells = row.select("td");
            if (cells.size() == 3) {
                String name = cells.get(0).text();
                String priceText = cells.get(1).text();
                String skuText = cells.get(2).text();

                Matcher priceMatcher = pricePattern.matcher(priceText);
                Matcher skuMatcher = skuPattern.matcher(skuText);

                if (priceMatcher.find() && skuMatcher.find()) {
                    double price = Double.parseDouble(priceMatcher.group(1));
                    int sku = Integer.parseInt(skuMatcher.group(1));
                    products.add(new Product(name, price, sku));
                    System.out.println("Valid product: " + name + " - $" + price);
                }
            }
        }
    }

    static class Product {
        String name;
        double price;
        int sku;

        Product(String name, double price, int sku) {
            this.name = name;
            this.price = price;
            this.sku = sku;
        }
    }
}

Attribute Matching with Regex

You can also apply regex to element attributes:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.util.regex.Pattern;

public class AttributeRegexExample {
    public static void main(String[] args) {
        String html = """
            <div>
                <img src="image1.jpg" alt="Photo">
                <img src="icon.png" alt="Icon">
                <img src="document.pdf" alt="Document">
                <a href="https://example.com">Link</a>
                <a href="mailto:test@example.com">Email</a>
            </div>
            """;

        Document doc = Jsoup.parse(html);

        // Select all images
        Elements images = doc.select("img");
        Pattern imagePattern = Pattern.compile(".*\\.(jpg|jpeg|png|gif)$", Pattern.CASE_INSENSITIVE);

        for (Element img : images) {
            String src = img.attr("src");
            if (imagePattern.matcher(src).matches()) {
                System.out.println("Valid image: " + src);
            }
        }

        // Select all links
        Elements links = doc.select("a[href]");
        Pattern emailPattern = Pattern.compile("^mailto:(.+@.+\\..+)$");

        for (Element link : links) {
            String href = link.attr("href");
            Matcher emailMatcher = emailPattern.matcher(href);
            if (emailMatcher.matches()) {
                String email = emailMatcher.group(1);
                System.out.println("Email found: " + email);
            }
        }
    }
}

Performance Tips

  1. Pre-compile patterns for better performance:
   private static final Pattern EMAIL_PATTERN = Pattern.compile(".*@.*\\..+");
  1. Use jsoup selectors first to narrow down the search space before applying regex

  2. Consider jsoup's built-in pseudo-selectors like :contains() for simple text matching:

   // Instead of regex for simple contains
   Elements elements = doc.select("p:contains(error)");

Common Use Cases

  • Data validation: Verify formats (emails, phone numbers, prices)
  • Content extraction: Extract specific patterns from text
  • URL filtering: Match specific URL patterns in href attributes
  • Text processing: Clean and standardize extracted content

Regular expressions with jsoup provide powerful capabilities for precise HTML parsing and data extraction when CSS selectors alone aren't sufficient.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon