How do I extract specific text patterns from HTML using jsoup?
JSoup is a powerful Java library for parsing and manipulating HTML documents, making it an excellent choice for extracting specific text patterns from web pages. Whether you need to extract email addresses, phone numbers, prices, or any other structured data, jsoup provides multiple approaches to accomplish this task efficiently.
Understanding Text Pattern Extraction in JSoup
Text pattern extraction involves identifying and retrieving specific data formats from HTML content. JSoup offers several methods to achieve this:
- CSS Selector-based extraction - Using CSS selectors to target specific elements
- Regular expression matching - Applying regex patterns to extracted text
- Attribute-based filtering - Extracting patterns from element attributes
- Combined approaches - Using multiple techniques together
Basic Setup and Dependencies
First, add jsoup to your Java project:
Maven Dependency
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.17.2</version>
</dependency>
Gradle Dependency
implementation 'org.jsoup:jsoup:1.17.2'
Extracting Text Patterns with CSS Selectors
The most straightforward approach is using CSS selectors to target specific elements containing your desired patterns.
Example 1: Extracting Email Addresses
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
import java.util.ArrayList;
import java.util.List;
public class EmailExtractor {
private static final String EMAIL_PATTERN =
"\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Z|a-z]{2,}\\b";
public static List<String> extractEmails(String html) {
Document doc = Jsoup.parse(html);
List<String> emails = new ArrayList<>();
Pattern pattern = Pattern.compile(EMAIL_PATTERN);
// Extract from all text content
String allText = doc.text();
Matcher matcher = pattern.matcher(allText);
while (matcher.find()) {
emails.add(matcher.group());
}
// Also check href attributes in anchor tags
Elements links = doc.select("a[href*=mailto:]");
for (Element link : links) {
String href = link.attr("href");
if (href.startsWith("mailto:")) {
String email = href.substring(7); // Remove "mailto:"
if (!emails.contains(email)) {
emails.add(email);
}
}
}
return emails;
}
}
Example 2: Extracting Phone Numbers
public class PhoneExtractor {
private static final String PHONE_PATTERN =
"\\b(?:\\+?1[-.]?)?\\(?([0-9]{3})\\)?[-.]?([0-9]{3})[-.]?([0-9]{4})\\b";
public static List<String> extractPhoneNumbers(String url) throws IOException {
Document doc = Jsoup.connect(url).get();
List<String> phoneNumbers = new ArrayList<>();
Pattern pattern = Pattern.compile(PHONE_PATTERN);
// Search in specific elements that commonly contain phone numbers
Elements contactElements = doc.select(".contact, .phone, .tel, [class*=contact], [class*=phone]");
for (Element element : contactElements) {
String text = element.text();
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
phoneNumbers.add(matcher.group().trim());
}
}
// Also check tel: links
Elements telLinks = doc.select("a[href^=tel:]");
for (Element link : telLinks) {
String tel = link.attr("href").substring(4); // Remove "tel:"
phoneNumbers.add(tel);
}
return phoneNumbers;
}
}
Advanced Pattern Extraction Techniques
Extracting Prices and Currency Values
public class PriceExtractor {
private static final String PRICE_PATTERN =
"\\$?([0-9]{1,3}(?:,?[0-9]{3})*(?:\\.[0-9]{2})?)";
public static List<String> extractPrices(Document doc) {
List<String> prices = new ArrayList<>();
Pattern pattern = Pattern.compile(PRICE_PATTERN);
// Look in elements commonly containing prices
Elements priceElements = doc.select(
".price, .cost, .amount, [class*=price], [class*=cost], " +
"[data-price], .currency, .money"
);
for (Element element : priceElements) {
String text = element.text();
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
prices.add(matcher.group());
}
}
return prices;
}
}
Extracting URLs and Links
public class URLExtractor {
private static final String URL_PATTERN =
"https?://[\\w\\-]+(\\.[\\w\\-]+)+([\\w\\-\\.,@?^=%&:/~\\+#]*[\\w\\-\\@?^=%&/~\\+#])?";
public static List<String> extractURLs(Document doc) {
List<String> urls = new ArrayList<>();
Pattern pattern = Pattern.compile(URL_PATTERN);
// Extract from href attributes
Elements links = doc.select("a[href]");
for (Element link : links) {
String href = link.attr("abs:href"); // Get absolute URL
if (!href.isEmpty()) {
urls.add(href);
}
}
// Extract URLs from text content
String allText = doc.text();
Matcher matcher = pattern.matcher(allText);
while (matcher.find()) {
String url = matcher.group();
if (!urls.contains(url)) {
urls.add(url);
}
}
return urls;
}
}
Working with Structured Data
Extracting Dates
import java.util.Arrays;
import java.util.HashMap;
import java.util.Map;
public class DateExtractor {
private static final List<String> DATE_PATTERNS = Arrays.asList(
"\\b(\\d{1,2})/(\\d{1,2})/(\\d{4})\\b", // MM/dd/yyyy
"\\b(\\d{4})-(\\d{1,2})-(\\d{1,2})\\b", // yyyy-MM-dd
"\\b(\\d{1,2})\\s+(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\\s+(\\d{4})\\b" // dd MMM yyyy
);
public static List<String> extractDates(Document doc) {
List<String> dates = new ArrayList<>();
String text = doc.text();
for (String patternStr : DATE_PATTERNS) {
Pattern pattern = Pattern.compile(patternStr, Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
dates.add(matcher.group());
}
}
// Also check datetime attributes
Elements timeElements = doc.select("time[datetime]");
for (Element timeElement : timeElements) {
String datetime = timeElement.attr("datetime");
if (!datetime.isEmpty()) {
dates.add(datetime);
}
}
return dates;
}
}
Combining JSoup with Advanced Text Processing
Custom Pattern Extractor Class
import java.io.IOException;
public class CustomPatternExtractor {
private final Document document;
public CustomPatternExtractor(String html) {
this.document = Jsoup.parse(html);
}
public CustomPatternExtractor(String url, boolean isUrl) throws IOException {
if (isUrl) {
this.document = Jsoup.connect(url)
.userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36")
.timeout(10000)
.get();
} else {
this.document = Jsoup.parse(url);
}
}
public List<String> extractPattern(String regex, String cssSelector) {
List<String> results = new ArrayList<>();
Pattern pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
Elements elements = cssSelector.isEmpty() ?
document.select("body") : document.select(cssSelector);
for (Element element : elements) {
String text = element.text();
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
String match = matcher.group().trim();
if (!results.contains(match)) {
results.add(match);
}
}
}
return results;
}
public Map<String, List<String>> extractMultiplePatterns(Map<String, String> patterns) {
Map<String, List<String>> results = new HashMap<>();
for (Map.Entry<String, String> entry : patterns.entrySet()) {
String name = entry.getKey();
String regex = entry.getValue();
results.put(name, extractPattern(regex, ""));
}
return results;
}
}
Performance Optimization Tips
Efficient Element Selection
public class OptimizedExtractor {
public static List<String> extractEmailsOptimized(Document doc) {
List<String> emails = new ArrayList<>();
Pattern emailPattern = Pattern.compile(
"\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Z|a-z]{2,}\\b"
);
// Target specific elements likely to contain emails
Elements targetElements = doc.select(
"a[href*=@], .contact, .email, [class*=contact], " +
"[class*=email], footer, .footer, #contact"
);
// If no specific elements found, fall back to body
if (targetElements.isEmpty()) {
targetElements = doc.select("body");
}
for (Element element : targetElements) {
Matcher matcher = emailPattern.matcher(element.text());
while (matcher.find()) {
String email = matcher.group();
if (!emails.contains(email)) {
emails.add(email);
}
}
}
return emails;
}
}
Practical Example: Complete Data Extraction
public class WebDataExtractor {
public static void main(String[] args) {
try {
String url = "https://example.com/contact";
Document doc = Jsoup.connect(url)
.userAgent("Mozilla/5.0 (compatible; WebScraper/1.0)")
.timeout(10000)
.get();
// Extract various patterns
CustomPatternExtractor extractor = new CustomPatternExtractor(doc.html());
Map<String, String> patterns = new HashMap<>();
patterns.put("emails", "\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Z|a-z]{2,}\\b");
patterns.put("phones", "\\b(?:\\+?1[-.]?)?\\(?([0-9]{3})\\)?[-.]?([0-9]{3})[-.]?([0-9]{4})\\b");
patterns.put("prices", "\\$?([0-9]{1,3}(?:,?[0-9]{3})*(?:\\.[0-9]{2})?)");
Map<String, List<String>> results = extractor.extractMultiplePatterns(patterns);
// Print results
for (Map.Entry<String, List<String>> entry : results.entrySet()) {
System.out.println(entry.getKey().toUpperCase() + ":");
for (String value : entry.getValue()) {
System.out.println(" - " + value);
}
System.out.println();
}
} catch (IOException e) {
System.err.println("Error fetching page: " + e.getMessage());
}
}
}
Error Handling and Best Practices
Robust Pattern Extraction
public class RobustExtractor {
private static final int MAX_RETRIES = 3;
private static final int TIMEOUT_MS = 10000;
public static List<String> safeExtractPattern(String url, String pattern, String selector) {
for (int attempt = 1; attempt <= MAX_RETRIES; attempt++) {
try {
Document doc = Jsoup.connect(url)
.userAgent("Mozilla/5.0 (compatible; WebScraper/1.0)")
.timeout(TIMEOUT_MS)
.followRedirects(true)
.get();
return extractPatternFromDocument(doc, pattern, selector);
} catch (IOException e) {
System.err.println("Attempt " + attempt + " failed: " + e.getMessage());
if (attempt == MAX_RETRIES) {
System.err.println("All attempts failed for URL: " + url);
return new ArrayList<>();
}
// Wait before retry
try {
Thread.sleep(1000 * attempt);
} catch (InterruptedException ie) {
Thread.currentThread().interrupt();
return new ArrayList<>();
}
}
}
return new ArrayList<>();
}
private static List<String> extractPatternFromDocument(Document doc, String regex, String selector) {
List<String> results = new ArrayList<>();
Pattern pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
Elements elements = selector.isEmpty() ? doc.select("body") : doc.select(selector);
for (Element element : elements) {
Matcher matcher = pattern.matcher(element.text());
while (matcher.find()) {
results.add(matcher.group().trim());
}
}
return results;
}
}
JavaScript Patterns in Static HTML
Sometimes you need to extract data from JavaScript variables embedded in HTML. Here's how to handle that:
public class JavaScriptPatternExtractor {
public static String extractJavaScriptVariable(Document doc, String variableName) {
Elements scripts = doc.select("script");
for (Element script : scripts) {
String scriptContent = script.html();
// Pattern to match: var variableName = "value" or let variableName = 'value'
String pattern = "(?:var|let|const)\\s+" + variableName + "\\s*=\\s*['\"]([^'\"]*)['\"]";
Pattern regex = Pattern.compile(pattern);
Matcher matcher = regex.matcher(scriptContent);
if (matcher.find()) {
return matcher.group(1);
}
}
return null;
}
public static List<String> extractJsonData(Document doc, String jsonVariableName) {
List<String> results = new ArrayList<>();
Elements scripts = doc.select("script");
for (Element script : scripts) {
String scriptContent = script.html();
// Look for JSON objects assigned to variables
String pattern = jsonVariableName + "\\s*=\\s*(\\{[^}]*\\})";
Pattern regex = Pattern.compile(pattern, Pattern.DOTALL);
Matcher matcher = regex.matcher(scriptContent);
while (matcher.find()) {
results.add(matcher.group(1));
}
}
return results;
}
}
Integration with Web Scraping Workflows
JSoup's pattern extraction capabilities work excellently in combination with other web scraping tools. For complex scenarios involving JavaScript-heavy sites, you might want to first render the page with tools like Puppeteer for handling dynamic content, then use JSoup for efficient pattern extraction from the rendered HTML.
When working with large-scale scraping projects, consider implementing proper error handling mechanisms and rate limiting to ensure stable data extraction workflows.
Common Use Cases and Examples
Social Media Content Extraction
public class SocialMediaExtractor {
public static Map<String, String> extractSocialMetaTags(Document doc) {
Map<String, String> socialData = new HashMap<>();
// Extract Open Graph tags
Elements ogTags = doc.select("meta[property^=og:]");
for (Element tag : ogTags) {
String property = tag.attr("property");
String content = tag.attr("content");
socialData.put(property, content);
}
// Extract Twitter Card tags
Elements twitterTags = doc.select("meta[name^=twitter:]");
for (Element tag : twitterTags) {
String name = tag.attr("name");
String content = tag.attr("content");
socialData.put(name, content);
}
return socialData;
}
}
Product Information Extraction
public class ProductExtractor {
public static Map<String, String> extractProductInfo(Document doc) {
Map<String, String> productData = new HashMap<>();
// Extract product prices
List<String> priceSelectors = Arrays.asList(
".price", ".cost", "[class*=price]", "[data-price]", ".product-price"
);
for (String selector : priceSelectors) {
Elements priceElements = doc.select(selector);
if (!priceElements.isEmpty()) {
productData.put("price", priceElements.first().text());
break;
}
}
// Extract product SKU/ID
Pattern skuPattern = Pattern.compile("SKU:?\\s*([A-Z0-9\\-]+)", Pattern.CASE_INSENSITIVE);
Matcher skuMatcher = skuPattern.matcher(doc.text());
if (skuMatcher.find()) {
productData.put("sku", skuMatcher.group(1));
}
// Extract product availability
Elements availabilityElements = doc.select("[class*=stock], [class*=availability]");
for (Element element : availabilityElements) {
String text = element.text().toLowerCase();
if (text.contains("in stock") || text.contains("available")) {
productData.put("availability", "in_stock");
break;
} else if (text.contains("out of stock") || text.contains("unavailable")) {
productData.put("availability", "out_of_stock");
break;
}
}
return productData;
}
}
Conclusion
JSoup provides powerful capabilities for extracting specific text patterns from HTML documents. By combining CSS selectors with regular expressions, you can efficiently target and extract structured data like emails, phone numbers, prices, and custom patterns. The key to successful pattern extraction is understanding your target data structure and choosing the right combination of selectors and regex patterns for optimal performance and accuracy.
Key takeaways for effective pattern extraction with JSoup:
- Use targeted selectors: Focus on specific elements likely to contain your target data
- Combine multiple approaches: Use CSS selectors for structure and regex for patterns
- Handle edge cases: Account for different data formats and malformed HTML
- Implement error handling: Use retry logic and graceful degradation
- Optimize performance: Target specific elements rather than parsing entire documents
Remember to handle errors gracefully, implement appropriate timeouts, and respect website terms of service when building production scraping applications. JSoup's flexibility makes it an excellent choice for both simple pattern extraction tasks and complex data mining operations.