Table of contents

How do I use regular expressions for data extraction in Ruby web scraping?

Regular expressions (regex) are powerful pattern-matching tools that can significantly enhance your Ruby web scraping capabilities. While HTML parsers like Nokogiri are generally preferred for structured HTML content, regex excels at extracting specific data patterns from unstructured text, JavaScript variables, embedded JSON, and complex string formats.

Understanding Ruby Regex Basics

Ruby has built-in support for regular expressions with excellent performance and intuitive syntax. Here's how to get started:

# Basic regex syntax
pattern = /hello/
text = "hello world"
match = text.match(pattern)
puts match[0] if match  # Output: "hello"

# Using regex with variables
search_term = "email"
pattern = /#{search_term}/i  # Case-insensitive

Common Regex Patterns for Web Scraping

Email Extraction

require 'net/http'
require 'uri'

def extract_emails(html_content)
  email_pattern = /\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/
  emails = html_content.scan(email_pattern)
  emails.uniq
end

# Example usage
html = "<p>Contact us at info@example.com or support@company.org</p>"
emails = extract_emails(html)
puts emails  # ["info@example.com", "support@company.org"]

Phone Number Extraction

def extract_phone_numbers(text)
  # Matches various phone number formats
  phone_patterns = [
    /\(\d{3}\)\s*\d{3}-\d{4}/,           # (123) 456-7890
    /\d{3}-\d{3}-\d{4}/,                 # 123-456-7890
    /\d{3}\.\d{3}\.\d{4}/,               # 123.456.7890
    /\+1\s*\d{3}\s*\d{3}\s*\d{4}/       # +1 123 456 7890
  ]

  phone_numbers = []
  phone_patterns.each do |pattern|
    phone_numbers.concat(text.scan(pattern))
  end

  phone_numbers.uniq
end

Price Extraction

def extract_prices(text)
  # Matches various price formats
  price_pattern = /\$\d{1,3}(?:,\d{3})*(?:\.\d{2})?/
  prices = text.scan(price_pattern)

  # Convert to numeric values
  prices.map { |price| price.gsub(/[$,]/, '').to_f }
end

# Example
product_text = "Special offer: $19.99 (was $29.99)"
prices = extract_prices(product_text)
puts prices  # [19.99, 29.99]

Advanced Data Extraction Techniques

Extracting JavaScript Variables

When scraping single-page applications, you often need to extract data from JavaScript variables embedded in HTML:

def extract_js_variable(html, variable_name)
  # Pattern to match: var variableName = {...};
  pattern = /var\s+#{variable_name}\s*=\s*({.*?});/m
  match = html.match(pattern)

  return nil unless match

  begin
    # Parse the JSON data
    require 'json'
    JSON.parse(match[1])
  rescue JSON::ParserError
    match[1]  # Return raw string if not valid JSON
  end
end

# Example usage
html_with_js = <<~HTML
  <script>
    var productData = {"name": "Laptop", "price": 999.99};
    var userInfo = {"id": 123, "name": "John"};
  </script>
HTML

product_data = extract_js_variable(html_with_js, "productData")
puts product_data["name"]  # "Laptop"

Extracting URLs and Links

def extract_urls(text)
  url_pattern = /https?:\/\/(?:[-\w.])+(?:\:[0-9]+)?(?:\/(?:[\w\/_.])*(?:\?(?:[\w&=%.])*)?(?:\#(?:[\w.])*)?)?/
  urls = text.scan(url_pattern)
  urls.uniq
end

def extract_relative_links(html)
  # Extract href attributes
  link_pattern = /href\s*=\s*["']([^"']+)["']/i
  links = html.scan(link_pattern).flatten

  # Filter for relative links
  links.select { |link| !link.start_with?('http', 'mailto:', 'tel:') }
end

Social Media Handles

def extract_social_handles(text)
  social_patterns = {
    twitter: /@[A-Za-z0-9_]{1,15}/,
    instagram: /@[A-Za-z0-9_.]{1,30}/,
    linkedin: /linkedin\.com\/in\/([A-Za-z0-9\-._]+)/,
    facebook: /facebook\.com\/([A-Za-z0-9.]+)/
  }

  results = {}
  social_patterns.each do |platform, pattern|
    matches = text.scan(pattern)
    results[platform] = matches.flatten.uniq unless matches.empty?
  end

  results
end

Combining Regex with HTTP Requests

Here's a complete example that demonstrates scraping a webpage and extracting data using regex:

require 'net/http'
require 'uri'
require 'json'

class RegexScraper
  def initialize(url)
    @url = url
    @content = fetch_content
  end

  def fetch_content
    uri = URI(@url)
    response = Net::HTTP.get_response(uri)

    if response.code == '200'
      response.body
    else
      raise "Failed to fetch content: #{response.code}"
    end
  end

  def extract_emails
    email_pattern = /\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/
    @content.scan(email_pattern).uniq
  end

  def extract_meta_data
    meta_data = {}

    # Extract title
    title_match = @content.match(/<title[^>]*>(.*?)<\/title>/im)
    meta_data[:title] = title_match[1].strip if title_match

    # Extract meta description
    desc_pattern = /<meta\s+name=["']description["']\s+content=["']([^"']+)["']/i
    desc_match = @content.match(desc_pattern)
    meta_data[:description] = desc_match[1] if desc_match

    # Extract structured data (JSON-LD)
    json_ld_pattern = /<script[^>]*type=["']application\/ld\+json["'][^>]*>(.*?)<\/script>/im
    json_ld_matches = @content.scan(json_ld_pattern)

    if json_ld_matches.any?
      begin
        meta_data[:structured_data] = json_ld_matches.map { |match| JSON.parse(match[0]) }
      rescue JSON::ParserError
        # Handle malformed JSON
      end
    end

    meta_data
  end

  def extract_product_info
    # Extract product prices
    price_pattern = /\$\d{1,3}(?:,\d{3})*(?:\.\d{2})?/
    prices = @content.scan(price_pattern).map { |p| p.gsub(/[$,]/, '').to_f }

    # Extract product SKUs (example pattern)
    sku_pattern = /SKU:\s*([A-Z0-9\-]+)/i
    skus = @content.scan(sku_pattern).flatten

    {
      prices: prices,
      skus: skus
    }
  end
end

# Usage example
begin
  scraper = RegexScraper.new('https://example-store.com/product/123')

  emails = scraper.extract_emails
  meta_data = scraper.extract_meta_data
  product_info = scraper.extract_product_info

  puts "Found emails: #{emails}"
  puts "Page title: #{meta_data[:title]}"
  puts "Product prices: #{product_info[:prices]}"
rescue => e
  puts "Error: #{e.message}"
end

Best Practices and Performance Tips

1. Compile Regex Patterns

For frequently used patterns, compile them once:

class DataExtractor
  EMAIL_PATTERN = /\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/.freeze
  PHONE_PATTERN = /\(\d{3}\)\s*\d{3}-\d{4}/.freeze

  def extract_contact_info(text)
    {
      emails: text.scan(EMAIL_PATTERN),
      phones: text.scan(PHONE_PATTERN)
    }
  end
end

2. Use Named Capture Groups

Named groups make your code more readable and maintainable:

def extract_product_details(text)
  pattern = /Product:\s*(?<name>.*?)\s*Price:\s*\$(?<price>\d+\.?\d*)/

  matches = text.scan(pattern)
  matches.map do |name, price|
    { name: name.strip, price: price.to_f }
  end
end

3. Handle Edge Cases

def safe_regex_extract(text, pattern, default = [])
  return default if text.nil? || text.empty?

  begin
    text.scan(pattern)
  rescue RegexpError => e
    puts "Regex error: #{e.message}"
    default
  end
end

When to Use Regex vs. HTML Parsers

While regex is powerful, it's important to understand when to use it versus dedicated HTML parsers:

Use Regex when: - Extracting data from plain text or JavaScript - Working with non-HTML content - Extracting specific patterns (emails, phones, URLs) - Processing large amounts of text efficiently

Use HTML parsers (like Nokogiri) when: - Working with structured HTML - Need to traverse DOM elements - Extracting data based on HTML structure - Handling complex authentication flows where DOM interaction is required

Error Handling and Validation

Always implement proper error handling when using regex for web scraping:

def robust_data_extraction(html_content)
  results = {
    emails: [],
    phones: [],
    urls: [],
    errors: []
  }

  begin
    results[:emails] = extract_emails(html_content)
  rescue => e
    results[:errors] << "Email extraction failed: #{e.message}"
  end

  begin
    results[:phones] = extract_phone_numbers(html_content)
  rescue => e
    results[:errors] << "Phone extraction failed: #{e.message}"
  end

  begin
    results[:urls] = extract_urls(html_content)
  rescue => e
    results[:errors] << "URL extraction failed: #{e.message}"
  end

  results
end

Performance Optimization

For large-scale scraping operations, consider these optimizations:

# Use StringScanner for efficient pattern matching
require 'strscan'

def efficient_pattern_extraction(text, patterns)
  scanner = StringScanner.new(text)
  results = Hash.new { |h, k| h[k] = [] }

  until scanner.eos?
    patterns.each do |name, pattern|
      if match = scanner.scan(pattern)
        results[name] << match
        break
      end
    end
    scanner.getch unless scanner.eos?
  end

  results
end

Conclusion

Regular expressions are invaluable tools for Ruby web scraping, especially when dealing with unstructured text data, JavaScript variables, or specific pattern extraction. While they shouldn't replace proper HTML parsers for structured content, they excel at extracting emails, phone numbers, prices, and other formatted data.

Remember to always validate your regex patterns with diverse test data, handle edge cases gracefully, and consider performance implications when processing large amounts of content. When combined with Ruby's powerful HTTP libraries and HTML parsers, regex becomes part of a comprehensive web scraping toolkit.

For complex scenarios involving browser automation and dynamic content, consider integrating regex extraction with headless browser solutions to handle JavaScript-rendered content effectively.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon