How can I use regular expressions with Nokogiri selectors?
While Nokogiri doesn't natively support regular expressions in CSS selectors, you can effectively combine regular expressions with Nokogiri's powerful selection methods to create sophisticated pattern-matching capabilities for web scraping. This guide explores various techniques to integrate regex patterns with Nokogiri selectors for advanced HTML parsing.
Understanding Nokogiri's Selector Limitations
Nokogiri primarily supports CSS selectors and XPath expressions, neither of which have built-in regex support. However, you can work around this limitation using several approaches:
- Post-selection filtering with regex
- XPath contains() functions
- Custom attribute matching
- Text content pattern matching
Method 1: Post-Selection Filtering with Regular Expressions
The most straightforward approach is to select elements using standard Nokogiri selectors and then filter the results using regular expressions:
require 'nokogiri'
require 'open-uri'
# Parse HTML document
html = <<-HTML
<div class="product-item-123">Product A</div>
<div class="product-item-456">Product B</div>
<div class="special-offer-789">Special Deal</div>
<div class="product-item-abc">Product C</div>
HTML
doc = Nokogiri::HTML(html)
# Select all divs and filter by class pattern
product_divs = doc.css('div').select do |div|
div['class'] =~ /^product-item-\d+$/
end
product_divs.each do |div|
puts "Found: #{div.text} with class: #{div['class']}"
end
Method 2: Advanced Text Content Matching
You can combine Nokogiri selectors with regex to find elements based on text content patterns:
# Find elements containing email addresses
email_pattern = /\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/
# Select all text-containing elements and filter by email pattern
email_elements = doc.css('*').select do |element|
element.text =~ email_pattern
end
# Extract emails from the matching elements
emails = email_elements.map do |element|
element.text.scan(email_pattern)
end.flatten
puts "Found emails: #{emails}"
Method 3: Using XPath with Pattern Matching
While XPath doesn't support full regex, you can use functions like contains()
, starts-with()
, and substring()
for basic pattern matching:
# Find elements with IDs starting with "user-"
user_elements = doc.xpath("//div[starts-with(@id, 'user-')]")
# Find elements containing specific text patterns
price_elements = doc.xpath("//span[contains(text(), '$')]")
# Combine multiple XPath conditions
complex_selection = doc.xpath("//div[contains(@class, 'product') and starts-with(@id, 'item-')]")
Method 4: Attribute Pattern Matching
Filter elements based on attribute values using regular expressions:
# HTML with various data attributes
html = <<-HTML
<div data-product-id="PROD-2023-001">Item 1</div>
<div data-product-id="PROD-2023-002">Item 2</div>
<div data-product-id="SPECIAL-2023-001">Special Item</div>
<div data-user-id="USER-2023-001">User Info</div>
HTML
doc = Nokogiri::HTML(html)
# Find products with specific ID pattern
product_pattern = /^PROD-\d{4}-\d{3}$/
products = doc.css('div[data-product-id]').select do |div|
div['data-product-id'] =~ product_pattern
end
products.each do |product|
puts "Product: #{product.text} (ID: #{product['data-product-id']})"
end
Method 5: Complex URL and Link Pattern Matching
Extract and filter links based on URL patterns:
# Find links with specific URL patterns
url_pattern = /^https:\/\/api\.example\.com\/v\d+\//
api_links = doc.css('a[href]').select do |link|
link['href'] =~ url_pattern
end
# Extract version numbers from API URLs
version_pattern = /\/v(\d+)\//
api_versions = api_links.map do |link|
match = link['href'].match(version_pattern)
match ? match[1].to_i : nil
end.compact
puts "Found API versions: #{api_versions.uniq.sort}"
Method 6: Creating a Custom Selector Helper
Create a reusable helper method for regex-based element selection:
class NokogiriRegexHelper
def self.select_by_attribute_pattern(doc, selector, attribute, pattern)
doc.css(selector).select do |element|
element[attribute] && element[attribute] =~ pattern
end
end
def self.select_by_text_pattern(doc, selector, pattern)
doc.css(selector).select do |element|
element.text =~ pattern
end
end
def self.select_by_combined_pattern(doc, selector, conditions)
doc.css(selector).select do |element|
conditions.all? do |condition|
case condition[:type]
when :attribute
element[condition[:attribute]] =~ condition[:pattern]
when :text
element.text =~ condition[:pattern]
when :class
element['class'] =~ condition[:pattern]
end
end
end
end
end
# Usage examples
phone_pattern = /^\+?1?[-.\s]?\(?[0-9]{3}\)?[-.\s]?[0-9]{3}[-.\s]?[0-9]{4}$/
phone_elements = NokogiriRegexHelper.select_by_text_pattern(doc, 'span', phone_pattern)
# Complex multi-condition selection
complex_conditions = [
{ type: :attribute, attribute: 'data-type', pattern: /^product-/ },
{ type: :text, pattern: /\$\d+\.\d{2}/ },
{ type: :class, pattern: /featured/ }
]
featured_products = NokogiriRegexHelper.select_by_combined_pattern(
doc, 'div', complex_conditions
)
Method 7: Handling Dynamic Content Patterns
When working with dynamically generated content, regex patterns become particularly useful:
# Parse HTML with dynamic class names and IDs
dynamic_html = <<-HTML
<div class="component-abc123-def456">Component A</div>
<div class="component-xyz789-uvw012">Component B</div>
<div id="widget-2023-11-15-001">Widget 1</div>
<div id="widget-2023-11-15-002">Widget 2</div>
HTML
doc = Nokogiri::HTML(dynamic_html)
# Match components with UUID-like class names
component_pattern = /^component-[a-f0-9]{6}-[a-f0-9]{6}$/
components = doc.css('div').select do |div|
div['class'] =~ component_pattern
end
# Match widgets with date-based IDs
widget_pattern = /^widget-\d{4}-\d{2}-\d{2}-\d{3}$/
widgets = doc.css('div').select do |div|
div['id'] =~ widget_pattern
end
# Extract dates from widget IDs
date_pattern = /widget-(\d{4}-\d{2}-\d{2})-/
widget_dates = widgets.map do |widget|
match = widget['id'].match(date_pattern)
match ? Date.parse(match[1]) : nil
end.compact
Working with JavaScript-Heavy Sites
When dealing with dynamic content that requires JavaScript execution, regular expressions with Nokogiri alone might not be sufficient. In such cases, you may need to combine Nokogiri with headless browser tools. For instance, handling authentication in Puppeteer can help you access protected content before parsing with Nokogiri.
Performance Considerations
When using regular expressions with Nokogiri selectors, keep these performance tips in mind:
# Compile regex patterns once for better performance
EMAIL_PATTERN = /\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/.freeze
PHONE_PATTERN = /^\+?1?[-.\s]?\(?[0-9]{3}\)?[-.\s]?[0-9]{3}[-.\s]?[0-9]{4}$/.freeze
# Use specific selectors first, then filter
# Good: Narrow selection first
contact_spans = doc.css('span.contact-info').select { |span| span.text =~ EMAIL_PATTERN }
# Less efficient: Select all elements first
all_elements = doc.css('*').select { |el| el.text =~ EMAIL_PATTERN }
# Cache compiled selectors for repeated use
class SelectorCache
def initialize
@patterns = {}
end
def get_pattern(key, regex_string)
@patterns[key] ||= Regexp.new(regex_string)
end
end
cache = SelectorCache.new
pattern = cache.get_pattern(:email, EMAIL_PATTERN.source)
Real-World Example: Product Scraping
Here's a comprehensive example that demonstrates regex usage in a practical product scraping scenario:
class ProductScraper
PRODUCT_ID_PATTERN = /^PROD-\d{4}-[A-Z]{3}$/.freeze
PRICE_PATTERN = /\$(\d+(?:\.\d{2})?)/.freeze
SKU_PATTERN = /SKU[:\s]+([A-Z0-9-]+)/i.freeze
def initialize(html)
@doc = Nokogiri::HTML(html)
end
def extract_products
product_containers = @doc.css('div.product-container, article.product')
product_containers.filter_map do |container|
product_id = extract_product_id(container)
next unless product_id
{
id: product_id,
name: extract_product_name(container),
price: extract_price(container),
sku: extract_sku(container),
category: extract_category(container)
}
end
end
private
def extract_product_id(container)
id = container['data-product-id'] || container['id']
return nil unless id&.match?(PRODUCT_ID_PATTERN)
id
end
def extract_product_name(container)
container.css('.product-name, h2, h3').first&.text&.strip
end
def extract_price(container)
price_text = container.css('.price, .cost, .amount').first&.text
return nil unless price_text
match = price_text.match(PRICE_PATTERN)
match ? match[1].to_f : nil
end
def extract_sku(container)
sku_text = container.text
match = sku_text.match(SKU_PATTERN)
match ? match[1] : nil
end
def extract_category(container)
# Look for category in breadcrumbs or data attributes
category_element = container.css('.breadcrumb a, [data-category]').last
category_element&.text&.strip || container['data-category']
end
end
# Usage
html = File.read('products.html')
scraper = ProductScraper.new(html)
products = scraper.extract_products
products.each do |product|
puts "Product: #{product[:name]} (#{product[:id]}) - $#{product[:price]}"
end
Error Handling and Edge Cases
When working with regex and Nokogiri, always handle potential errors and edge cases:
def safe_regex_match(text, pattern)
return nil if text.nil? || text.empty?
begin
match = text.match(pattern)
match ? match.captures : nil
rescue RegexpError => e
puts "Regex error: #{e.message}"
nil
end
end
# Handle malformed HTML gracefully
def parse_with_error_handling(html)
begin
Nokogiri::HTML(html) do |config|
config.recover
end
rescue => e
puts "HTML parsing error: #{e.message}"
Nokogiri::HTML::Document.new
end
end
Integration with Modern Web Scraping
For modern web applications that rely heavily on JavaScript, you might need to combine Nokogiri with browser automation tools. Managing browser sessions in Puppeteer can help you capture the fully rendered HTML before applying Nokogiri with regex patterns.
Best Practices Summary
- Compile regex patterns once: Store frequently used patterns as constants
- Use specific selectors first: Narrow down elements before applying regex filters
- Handle nil values: Always check for nil attributes before applying regex
- Test patterns thoroughly: Validate regex patterns with various input formats
- Consider performance: For large documents, balance regex complexity with performance needs
- Cache compiled patterns: Use pattern caching for repeated operations
- Graceful error handling: Always handle potential regex and parsing errors
- Document your patterns: Comment complex regex patterns for maintainability
By combining Nokogiri's powerful selection capabilities with regular expressions, you can create robust and flexible web scraping solutions that handle complex HTML patterns and dynamic content structures effectively. This approach gives you the precision of regex pattern matching while leveraging Nokogiri's efficient HTML parsing and selection methods.