How do I extract specific text patterns from HTML using jsoup?

JSoup is a powerful Java library for parsing and manipulating HTML documents, making it an excellent choice for extracting specific text patterns from web pages. Whether you need to extract email addresses, phone numbers, prices, or any other structured data, jsoup provides multiple approaches to accomplish this task efficiently.

Understanding Text Pattern Extraction in JSoup

Text pattern extraction involves identifying and retrieving specific data formats from HTML content. JSoup offers several methods to achieve this:

CSS Selector-based extraction - Using CSS selectors to target specific elements
Regular expression matching - Applying regex patterns to extracted text
Attribute-based filtering - Extracting patterns from element attributes
Combined approaches - Using multiple techniques together

Basic Setup and Dependencies

First, add jsoup to your Java project:

Maven Dependency

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.17.2</version>
</dependency>

Gradle Dependency

implementation 'org.jsoup:jsoup:1.17.2'

Extracting Text Patterns with CSS Selectors

The most straightforward approach is using CSS selectors to target specific elements containing your desired patterns.

Example 1: Extracting Email Addresses

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
import java.util.ArrayList;
import java.util.List;

public class EmailExtractor {
    private static final String EMAIL_PATTERN = 
        "\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Z|a-z]{2,}\\b";

    public static List<String> extractEmails(String html) {
        Document doc = Jsoup.parse(html);
        List<String> emails = new ArrayList<>();
        Pattern pattern = Pattern.compile(EMAIL_PATTERN);

        // Extract from all text content
        String allText = doc.text();
        Matcher matcher = pattern.matcher(allText);

        while (matcher.find()) {
            emails.add(matcher.group());
        }

        // Also check href attributes in anchor tags
        Elements links = doc.select("a[href*=mailto:]");
        for (Element link : links) {
            String href = link.attr("href");
            if (href.startsWith("mailto:")) {
                String email = href.substring(7); // Remove "mailto:"
                if (!emails.contains(email)) {
                    emails.add(email);
                }
            }
        }

        return emails;
    }
}

Example 2: Extracting Phone Numbers

public class PhoneExtractor {
    private static final String PHONE_PATTERN = 
        "\\b(?:\\+?1[-.]?)?\\(?([0-9]{3})\\)?[-.]?([0-9]{3})[-.]?([0-9]{4})\\b";

    public static List<String> extractPhoneNumbers(String url) throws IOException {
        Document doc = Jsoup.connect(url).get();
        List<String> phoneNumbers = new ArrayList<>();
        Pattern pattern = Pattern.compile(PHONE_PATTERN);

        // Search in specific elements that commonly contain phone numbers
        Elements contactElements = doc.select(".contact, .phone, .tel, [class*=contact], [class*=phone]");

        for (Element element : contactElements) {
            String text = element.text();
            Matcher matcher = pattern.matcher(text);

            while (matcher.find()) {
                phoneNumbers.add(matcher.group().trim());
            }
        }

        // Also check tel: links
        Elements telLinks = doc.select("a[href^=tel:]");
        for (Element link : telLinks) {
            String tel = link.attr("href").substring(4); // Remove "tel:"
            phoneNumbers.add(tel);
        }

        return phoneNumbers;
    }
}

Advanced Pattern Extraction Techniques

Extracting Prices and Currency Values

public class PriceExtractor {
    private static final String PRICE_PATTERN = 
        "\\$?([0-9]{1,3}(?:,?[0-9]{3})*(?:\\.[0-9]{2})?)";

    public static List<String> extractPrices(Document doc) {
        List<String> prices = new ArrayList<>();
        Pattern pattern = Pattern.compile(PRICE_PATTERN);

        // Look in elements commonly containing prices
        Elements priceElements = doc.select(
            ".price, .cost, .amount, [class*=price], [class*=cost], " +
            "[data-price], .currency, .money"
        );

        for (Element element : priceElements) {
            String text = element.text();
            Matcher matcher = pattern.matcher(text);

            while (matcher.find()) {
                prices.add(matcher.group());
            }
        }

        return prices;
    }
}

Extracting URLs and Links

public class URLExtractor {
    private static final String URL_PATTERN = 
        "https?://[\\w\\-]+(\\.[\\w\\-]+)+([\\w\\-\\.,@?^=%&:/~\\+#]*[\\w\\-\\@?^=%&/~\\+#])?";

    public static List<String> extractURLs(Document doc) {
        List<String> urls = new ArrayList<>();
        Pattern pattern = Pattern.compile(URL_PATTERN);

        // Extract from href attributes
        Elements links = doc.select("a[href]");
        for (Element link : links) {
            String href = link.attr("abs:href"); // Get absolute URL
            if (!href.isEmpty()) {
                urls.add(href);
            }
        }

        // Extract URLs from text content
        String allText = doc.text();
        Matcher matcher = pattern.matcher(allText);

        while (matcher.find()) {
            String url = matcher.group();
            if (!urls.contains(url)) {
                urls.add(url);
            }
        }

        return urls;
    }
}

Working with Structured Data

Extracting Dates

import java.util.Arrays;
import java.util.HashMap;
import java.util.Map;

public class DateExtractor {
    private static final List<String> DATE_PATTERNS = Arrays.asList(
        "\\b(\\d{1,2})/(\\d{1,2})/(\\d{4})\\b",           // MM/dd/yyyy
        "\\b(\\d{4})-(\\d{1,2})-(\\d{1,2})\\b",          // yyyy-MM-dd
        "\\b(\\d{1,2})\\s+(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\\s+(\\d{4})\\b"  // dd MMM yyyy
    );

    public static List<String> extractDates(Document doc) {
        List<String> dates = new ArrayList<>();
        String text = doc.text();

        for (String patternStr : DATE_PATTERNS) {
            Pattern pattern = Pattern.compile(patternStr, Pattern.CASE_INSENSITIVE);
            Matcher matcher = pattern.matcher(text);

            while (matcher.find()) {
                dates.add(matcher.group());
            }
        }

        // Also check datetime attributes
        Elements timeElements = doc.select("time[datetime]");
        for (Element timeElement : timeElements) {
            String datetime = timeElement.attr("datetime");
            if (!datetime.isEmpty()) {
                dates.add(datetime);
            }
        }

        return dates;
    }
}

Combining JSoup with Advanced Text Processing

Custom Pattern Extractor Class

import java.io.IOException;

public class CustomPatternExtractor {
    private final Document document;

    public CustomPatternExtractor(String html) {
        this.document = Jsoup.parse(html);
    }

    public CustomPatternExtractor(String url, boolean isUrl) throws IOException {
        if (isUrl) {
            this.document = Jsoup.connect(url)
                .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36")
                .timeout(10000)
                .get();
        } else {
            this.document = Jsoup.parse(url);
        }
    }

    public List<String> extractPattern(String regex, String cssSelector) {
        List<String> results = new ArrayList<>();
        Pattern pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);

        Elements elements = cssSelector.isEmpty() ? 
            document.select("body") : document.select(cssSelector);

        for (Element element : elements) {
            String text = element.text();
            Matcher matcher = pattern.matcher(text);

            while (matcher.find()) {
                String match = matcher.group().trim();
                if (!results.contains(match)) {
                    results.add(match);
                }
            }
        }

        return results;
    }

    public Map<String, List<String>> extractMultiplePatterns(Map<String, String> patterns) {
        Map<String, List<String>> results = new HashMap<>();

        for (Map.Entry<String, String> entry : patterns.entrySet()) {
            String name = entry.getKey();
            String regex = entry.getValue();
            results.put(name, extractPattern(regex, ""));
        }

        return results;
    }
}

Performance Optimization Tips

Efficient Element Selection

public class OptimizedExtractor {
    public static List<String> extractEmailsOptimized(Document doc) {
        List<String> emails = new ArrayList<>();
        Pattern emailPattern = Pattern.compile(
            "\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Z|a-z]{2,}\\b"
        );

        // Target specific elements likely to contain emails
        Elements targetElements = doc.select(
            "a[href*=@], .contact, .email, [class*=contact], " +
            "[class*=email], footer, .footer, #contact"
        );

        // If no specific elements found, fall back to body
        if (targetElements.isEmpty()) {
            targetElements = doc.select("body");
        }

        for (Element element : targetElements) {
            Matcher matcher = emailPattern.matcher(element.text());
            while (matcher.find()) {
                String email = matcher.group();
                if (!emails.contains(email)) {
                    emails.add(email);
                }
            }
        }

        return emails;
    }
}

Practical Example: Complete Data Extraction

public class WebDataExtractor {
    public static void main(String[] args) {
        try {
            String url = "https://example.com/contact";
            Document doc = Jsoup.connect(url)
                .userAgent("Mozilla/5.0 (compatible; WebScraper/1.0)")
                .timeout(10000)
                .get();

            // Extract various patterns
            CustomPatternExtractor extractor = new CustomPatternExtractor(doc.html());

            Map<String, String> patterns = new HashMap<>();
            patterns.put("emails", "\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Z|a-z]{2,}\\b");
            patterns.put("phones", "\\b(?:\\+?1[-.]?)?\\(?([0-9]{3})\\)?[-.]?([0-9]{3})[-.]?([0-9]{4})\\b");
            patterns.put("prices", "\\$?([0-9]{1,3}(?:,?[0-9]{3})*(?:\\.[0-9]{2})?)");

            Map<String, List<String>> results = extractor.extractMultiplePatterns(patterns);

            // Print results
            for (Map.Entry<String, List<String>> entry : results.entrySet()) {
                System.out.println(entry.getKey().toUpperCase() + ":");
                for (String value : entry.getValue()) {
                    System.out.println("  - " + value);
                }
                System.out.println();
            }

        } catch (IOException e) {
            System.err.println("Error fetching page: " + e.getMessage());
        }
    }
}

Error Handling and Best Practices

Robust Pattern Extraction

public class RobustExtractor {
    private static final int MAX_RETRIES = 3;
    private static final int TIMEOUT_MS = 10000;

    public static List<String> safeExtractPattern(String url, String pattern, String selector) {
        for (int attempt = 1; attempt <= MAX_RETRIES; attempt++) {
            try {
                Document doc = Jsoup.connect(url)
                    .userAgent("Mozilla/5.0 (compatible; WebScraper/1.0)")
                    .timeout(TIMEOUT_MS)
                    .followRedirects(true)
                    .get();

                return extractPatternFromDocument(doc, pattern, selector);

            } catch (IOException e) {
                System.err.println("Attempt " + attempt + " failed: " + e.getMessage());
                if (attempt == MAX_RETRIES) {
                    System.err.println("All attempts failed for URL: " + url);
                    return new ArrayList<>();
                }

                // Wait before retry
                try {
                    Thread.sleep(1000 * attempt);
                } catch (InterruptedException ie) {
                    Thread.currentThread().interrupt();
                    return new ArrayList<>();
                }
            }
        }
        return new ArrayList<>();
    }

    private static List<String> extractPatternFromDocument(Document doc, String regex, String selector) {
        List<String> results = new ArrayList<>();
        Pattern pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);

        Elements elements = selector.isEmpty() ? doc.select("body") : doc.select(selector);

        for (Element element : elements) {
            Matcher matcher = pattern.matcher(element.text());
            while (matcher.find()) {
                results.add(matcher.group().trim());
            }
        }

        return results;
    }
}

JavaScript Patterns in Static HTML

Sometimes you need to extract data from JavaScript variables embedded in HTML. Here's how to handle that:

public class JavaScriptPatternExtractor {
    public static String extractJavaScriptVariable(Document doc, String variableName) {
        Elements scripts = doc.select("script");

        for (Element script : scripts) {
            String scriptContent = script.html();

            // Pattern to match: var variableName = "value" or let variableName = 'value'
            String pattern = "(?:var|let|const)\\s+" + variableName + "\\s*=\\s*['\"]([^'\"]*)['\"]";
            Pattern regex = Pattern.compile(pattern);
            Matcher matcher = regex.matcher(scriptContent);

            if (matcher.find()) {
                return matcher.group(1);
            }
        }

        return null;
    }

    public static List<String> extractJsonData(Document doc, String jsonVariableName) {
        List<String> results = new ArrayList<>();
        Elements scripts = doc.select("script");

        for (Element script : scripts) {
            String scriptContent = script.html();

            // Look for JSON objects assigned to variables
            String pattern = jsonVariableName + "\\s*=\\s*(\\{[^}]*\\})";
            Pattern regex = Pattern.compile(pattern, Pattern.DOTALL);
            Matcher matcher = regex.matcher(scriptContent);

            while (matcher.find()) {
                results.add(matcher.group(1));
            }
        }

        return results;
    }
}

Integration with Web Scraping Workflows

JSoup's pattern extraction capabilities work excellently in combination with other web scraping tools. For complex scenarios involving JavaScript-heavy sites, you might want to first render the page with tools like Puppeteer for handling dynamic content, then use JSoup for efficient pattern extraction from the rendered HTML.

When working with large-scale scraping projects, consider implementing proper error handling mechanisms and rate limiting to ensure stable data extraction workflows.

Common Use Cases and Examples

Social Media Content Extraction

public class SocialMediaExtractor {
    public static Map<String, String> extractSocialMetaTags(Document doc) {
        Map<String, String> socialData = new HashMap<>();

        // Extract Open Graph tags
        Elements ogTags = doc.select("meta[property^=og:]");
        for (Element tag : ogTags) {
            String property = tag.attr("property");
            String content = tag.attr("content");
            socialData.put(property, content);
        }

        // Extract Twitter Card tags
        Elements twitterTags = doc.select("meta[name^=twitter:]");
        for (Element tag : twitterTags) {
            String name = tag.attr("name");
            String content = tag.attr("content");
            socialData.put(name, content);
        }

        return socialData;
    }
}

Product Information Extraction

public class ProductExtractor {
    public static Map<String, String> extractProductInfo(Document doc) {
        Map<String, String> productData = new HashMap<>();

        // Extract product prices
        List<String> priceSelectors = Arrays.asList(
            ".price", ".cost", "[class*=price]", "[data-price]", ".product-price"
        );

        for (String selector : priceSelectors) {
            Elements priceElements = doc.select(selector);
            if (!priceElements.isEmpty()) {
                productData.put("price", priceElements.first().text());
                break;
            }
        }

        // Extract product SKU/ID
        Pattern skuPattern = Pattern.compile("SKU:?\\s*([A-Z0-9\\-]+)", Pattern.CASE_INSENSITIVE);
        Matcher skuMatcher = skuPattern.matcher(doc.text());
        if (skuMatcher.find()) {
            productData.put("sku", skuMatcher.group(1));
        }

        // Extract product availability
        Elements availabilityElements = doc.select("[class*=stock], [class*=availability]");
        for (Element element : availabilityElements) {
            String text = element.text().toLowerCase();
            if (text.contains("in stock") || text.contains("available")) {
                productData.put("availability", "in_stock");
                break;
            } else if (text.contains("out of stock") || text.contains("unavailable")) {
                productData.put("availability", "out_of_stock");
                break;
            }
        }

        return productData;
    }
}

Conclusion

JSoup provides powerful capabilities for extracting specific text patterns from HTML documents. By combining CSS selectors with regular expressions, you can efficiently target and extract structured data like emails, phone numbers, prices, and custom patterns. The key to successful pattern extraction is understanding your target data structure and choosing the right combination of selectors and regex patterns for optimal performance and accuracy.

Key takeaways for effective pattern extraction with JSoup:

Use targeted selectors: Focus on specific elements likely to contain your target data
Combine multiple approaches: Use CSS selectors for structure and regex for patterns
Handle edge cases: Account for different data formats and malformed HTML
Implement error handling: Use retry logic and graceful degradation
Optimize performance: Target specific elements rather than parsing entire documents

Remember to handle errors gracefully, implement appropriate timeouts, and respect website terms of service when building production scraping applications. JSoup's flexibility makes it an excellent choice for both simple pattern extraction tasks and complex data mining operations.

Table of contents