Table of contents

How can I use regular expressions with Nokogiri selectors?

While Nokogiri doesn't natively support regular expressions in CSS selectors, you can effectively combine regular expressions with Nokogiri's powerful selection methods to create sophisticated pattern-matching capabilities for web scraping. This guide explores various techniques to integrate regex patterns with Nokogiri selectors for advanced HTML parsing.

Understanding Nokogiri's Selector Limitations

Nokogiri primarily supports CSS selectors and XPath expressions, neither of which have built-in regex support. However, you can work around this limitation using several approaches:

  1. Post-selection filtering with regex
  2. XPath contains() functions
  3. Custom attribute matching
  4. Text content pattern matching

Method 1: Post-Selection Filtering with Regular Expressions

The most straightforward approach is to select elements using standard Nokogiri selectors and then filter the results using regular expressions:

require 'nokogiri'
require 'open-uri'

# Parse HTML document
html = <<-HTML
<div class="product-item-123">Product A</div>
<div class="product-item-456">Product B</div>
<div class="special-offer-789">Special Deal</div>
<div class="product-item-abc">Product C</div>
HTML

doc = Nokogiri::HTML(html)

# Select all divs and filter by class pattern
product_divs = doc.css('div').select do |div|
  div['class'] =~ /^product-item-\d+$/
end

product_divs.each do |div|
  puts "Found: #{div.text} with class: #{div['class']}"
end

Method 2: Advanced Text Content Matching

You can combine Nokogiri selectors with regex to find elements based on text content patterns:

# Find elements containing email addresses
email_pattern = /\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/

# Select all text-containing elements and filter by email pattern
email_elements = doc.css('*').select do |element|
  element.text =~ email_pattern
end

# Extract emails from the matching elements
emails = email_elements.map do |element|
  element.text.scan(email_pattern)
end.flatten

puts "Found emails: #{emails}"

Method 3: Using XPath with Pattern Matching

While XPath doesn't support full regex, you can use functions like contains(), starts-with(), and substring() for basic pattern matching:

# Find elements with IDs starting with "user-"
user_elements = doc.xpath("//div[starts-with(@id, 'user-')]")

# Find elements containing specific text patterns
price_elements = doc.xpath("//span[contains(text(), '$')]")

# Combine multiple XPath conditions
complex_selection = doc.xpath("//div[contains(@class, 'product') and starts-with(@id, 'item-')]")

Method 4: Attribute Pattern Matching

Filter elements based on attribute values using regular expressions:

# HTML with various data attributes
html = <<-HTML
<div data-product-id="PROD-2023-001">Item 1</div>
<div data-product-id="PROD-2023-002">Item 2</div>
<div data-product-id="SPECIAL-2023-001">Special Item</div>
<div data-user-id="USER-2023-001">User Info</div>
HTML

doc = Nokogiri::HTML(html)

# Find products with specific ID pattern
product_pattern = /^PROD-\d{4}-\d{3}$/
products = doc.css('div[data-product-id]').select do |div|
  div['data-product-id'] =~ product_pattern
end

products.each do |product|
  puts "Product: #{product.text} (ID: #{product['data-product-id']})"
end

Method 5: Complex URL and Link Pattern Matching

Extract and filter links based on URL patterns:

# Find links with specific URL patterns
url_pattern = /^https:\/\/api\.example\.com\/v\d+\//

api_links = doc.css('a[href]').select do |link|
  link['href'] =~ url_pattern
end

# Extract version numbers from API URLs
version_pattern = /\/v(\d+)\//
api_versions = api_links.map do |link|
  match = link['href'].match(version_pattern)
  match ? match[1].to_i : nil
end.compact

puts "Found API versions: #{api_versions.uniq.sort}"

Method 6: Creating a Custom Selector Helper

Create a reusable helper method for regex-based element selection:

class NokogiriRegexHelper
  def self.select_by_attribute_pattern(doc, selector, attribute, pattern)
    doc.css(selector).select do |element|
      element[attribute] && element[attribute] =~ pattern
    end
  end

  def self.select_by_text_pattern(doc, selector, pattern)
    doc.css(selector).select do |element|
      element.text =~ pattern
    end
  end

  def self.select_by_combined_pattern(doc, selector, conditions)
    doc.css(selector).select do |element|
      conditions.all? do |condition|
        case condition[:type]
        when :attribute
          element[condition[:attribute]] =~ condition[:pattern]
        when :text
          element.text =~ condition[:pattern]
        when :class
          element['class'] =~ condition[:pattern]
        end
      end
    end
  end
end

# Usage examples
phone_pattern = /^\+?1?[-.\s]?\(?[0-9]{3}\)?[-.\s]?[0-9]{3}[-.\s]?[0-9]{4}$/
phone_elements = NokogiriRegexHelper.select_by_text_pattern(doc, 'span', phone_pattern)

# Complex multi-condition selection
complex_conditions = [
  { type: :attribute, attribute: 'data-type', pattern: /^product-/ },
  { type: :text, pattern: /\$\d+\.\d{2}/ },
  { type: :class, pattern: /featured/ }
]

featured_products = NokogiriRegexHelper.select_by_combined_pattern(
  doc, 'div', complex_conditions
)

Method 7: Handling Dynamic Content Patterns

When working with dynamically generated content, regex patterns become particularly useful:

# Parse HTML with dynamic class names and IDs
dynamic_html = <<-HTML
<div class="component-abc123-def456">Component A</div>
<div class="component-xyz789-uvw012">Component B</div>
<div id="widget-2023-11-15-001">Widget 1</div>
<div id="widget-2023-11-15-002">Widget 2</div>
HTML

doc = Nokogiri::HTML(dynamic_html)

# Match components with UUID-like class names
component_pattern = /^component-[a-f0-9]{6}-[a-f0-9]{6}$/
components = doc.css('div').select do |div|
  div['class'] =~ component_pattern
end

# Match widgets with date-based IDs
widget_pattern = /^widget-\d{4}-\d{2}-\d{2}-\d{3}$/
widgets = doc.css('div').select do |div|
  div['id'] =~ widget_pattern
end

# Extract dates from widget IDs
date_pattern = /widget-(\d{4}-\d{2}-\d{2})-/
widget_dates = widgets.map do |widget|
  match = widget['id'].match(date_pattern)
  match ? Date.parse(match[1]) : nil
end.compact

Working with JavaScript-Heavy Sites

When dealing with dynamic content that requires JavaScript execution, regular expressions with Nokogiri alone might not be sufficient. In such cases, you may need to combine Nokogiri with headless browser tools. For instance, handling authentication in Puppeteer can help you access protected content before parsing with Nokogiri.

Performance Considerations

When using regular expressions with Nokogiri selectors, keep these performance tips in mind:

# Compile regex patterns once for better performance
EMAIL_PATTERN = /\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/.freeze
PHONE_PATTERN = /^\+?1?[-.\s]?\(?[0-9]{3}\)?[-.\s]?[0-9]{3}[-.\s]?[0-9]{4}$/.freeze

# Use specific selectors first, then filter
# Good: Narrow selection first
contact_spans = doc.css('span.contact-info').select { |span| span.text =~ EMAIL_PATTERN }

# Less efficient: Select all elements first
all_elements = doc.css('*').select { |el| el.text =~ EMAIL_PATTERN }

# Cache compiled selectors for repeated use
class SelectorCache
  def initialize
    @patterns = {}
  end

  def get_pattern(key, regex_string)
    @patterns[key] ||= Regexp.new(regex_string)
  end
end

cache = SelectorCache.new
pattern = cache.get_pattern(:email, EMAIL_PATTERN.source)

Real-World Example: Product Scraping

Here's a comprehensive example that demonstrates regex usage in a practical product scraping scenario:

class ProductScraper
  PRODUCT_ID_PATTERN = /^PROD-\d{4}-[A-Z]{3}$/.freeze
  PRICE_PATTERN = /\$(\d+(?:\.\d{2})?)/.freeze
  SKU_PATTERN = /SKU[:\s]+([A-Z0-9-]+)/i.freeze

  def initialize(html)
    @doc = Nokogiri::HTML(html)
  end

  def extract_products
    product_containers = @doc.css('div.product-container, article.product')

    product_containers.filter_map do |container|
      product_id = extract_product_id(container)
      next unless product_id

      {
        id: product_id,
        name: extract_product_name(container),
        price: extract_price(container),
        sku: extract_sku(container),
        category: extract_category(container)
      }
    end
  end

  private

  def extract_product_id(container)
    id = container['data-product-id'] || container['id']
    return nil unless id&.match?(PRODUCT_ID_PATTERN)
    id
  end

  def extract_product_name(container)
    container.css('.product-name, h2, h3').first&.text&.strip
  end

  def extract_price(container)
    price_text = container.css('.price, .cost, .amount').first&.text
    return nil unless price_text

    match = price_text.match(PRICE_PATTERN)
    match ? match[1].to_f : nil
  end

  def extract_sku(container)
    sku_text = container.text
    match = sku_text.match(SKU_PATTERN)
    match ? match[1] : nil
  end

  def extract_category(container)
    # Look for category in breadcrumbs or data attributes
    category_element = container.css('.breadcrumb a, [data-category]').last
    category_element&.text&.strip || container['data-category']
  end
end

# Usage
html = File.read('products.html')
scraper = ProductScraper.new(html)
products = scraper.extract_products

products.each do |product|
  puts "Product: #{product[:name]} (#{product[:id]}) - $#{product[:price]}"
end

Error Handling and Edge Cases

When working with regex and Nokogiri, always handle potential errors and edge cases:

def safe_regex_match(text, pattern)
  return nil if text.nil? || text.empty?

  begin
    match = text.match(pattern)
    match ? match.captures : nil
  rescue RegexpError => e
    puts "Regex error: #{e.message}"
    nil
  end
end

# Handle malformed HTML gracefully
def parse_with_error_handling(html)
  begin
    Nokogiri::HTML(html) do |config|
      config.recover
    end
  rescue => e
    puts "HTML parsing error: #{e.message}"
    Nokogiri::HTML::Document.new
  end
end

Integration with Modern Web Scraping

For modern web applications that rely heavily on JavaScript, you might need to combine Nokogiri with browser automation tools. Managing browser sessions in Puppeteer can help you capture the fully rendered HTML before applying Nokogiri with regex patterns.

Best Practices Summary

  1. Compile regex patterns once: Store frequently used patterns as constants
  2. Use specific selectors first: Narrow down elements before applying regex filters
  3. Handle nil values: Always check for nil attributes before applying regex
  4. Test patterns thoroughly: Validate regex patterns with various input formats
  5. Consider performance: For large documents, balance regex complexity with performance needs
  6. Cache compiled patterns: Use pattern caching for repeated operations
  7. Graceful error handling: Always handle potential regex and parsing errors
  8. Document your patterns: Comment complex regex patterns for maintainability

By combining Nokogiri's powerful selection capabilities with regular expressions, you can create robust and flexible web scraping solutions that handle complex HTML patterns and dynamic content structures effectively. This approach gives you the precision of regex pattern matching while leveraging Nokogiri's efficient HTML parsing and selection methods.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon