How do I use regular expressions for data extraction in Ruby web scraping?
Regular expressions (regex) are powerful pattern-matching tools that can significantly enhance your Ruby web scraping capabilities. While HTML parsers like Nokogiri are generally preferred for structured HTML content, regex excels at extracting specific data patterns from unstructured text, JavaScript variables, embedded JSON, and complex string formats.
Understanding Ruby Regex Basics
Ruby has built-in support for regular expressions with excellent performance and intuitive syntax. Here's how to get started:
# Basic regex syntax
pattern = /hello/
text = "hello world"
match = text.match(pattern)
puts match[0] if match # Output: "hello"
# Using regex with variables
search_term = "email"
pattern = /#{search_term}/i # Case-insensitive
Common Regex Patterns for Web Scraping
Email Extraction
require 'net/http'
require 'uri'
def extract_emails(html_content)
email_pattern = /\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/
emails = html_content.scan(email_pattern)
emails.uniq
end
# Example usage
html = "<p>Contact us at info@example.com or support@company.org</p>"
emails = extract_emails(html)
puts emails # ["info@example.com", "support@company.org"]
Phone Number Extraction
def extract_phone_numbers(text)
# Matches various phone number formats
phone_patterns = [
/\(\d{3}\)\s*\d{3}-\d{4}/, # (123) 456-7890
/\d{3}-\d{3}-\d{4}/, # 123-456-7890
/\d{3}\.\d{3}\.\d{4}/, # 123.456.7890
/\+1\s*\d{3}\s*\d{3}\s*\d{4}/ # +1 123 456 7890
]
phone_numbers = []
phone_patterns.each do |pattern|
phone_numbers.concat(text.scan(pattern))
end
phone_numbers.uniq
end
Price Extraction
def extract_prices(text)
# Matches various price formats
price_pattern = /\$\d{1,3}(?:,\d{3})*(?:\.\d{2})?/
prices = text.scan(price_pattern)
# Convert to numeric values
prices.map { |price| price.gsub(/[$,]/, '').to_f }
end
# Example
product_text = "Special offer: $19.99 (was $29.99)"
prices = extract_prices(product_text)
puts prices # [19.99, 29.99]
Advanced Data Extraction Techniques
Extracting JavaScript Variables
When scraping single-page applications, you often need to extract data from JavaScript variables embedded in HTML:
def extract_js_variable(html, variable_name)
# Pattern to match: var variableName = {...};
pattern = /var\s+#{variable_name}\s*=\s*({.*?});/m
match = html.match(pattern)
return nil unless match
begin
# Parse the JSON data
require 'json'
JSON.parse(match[1])
rescue JSON::ParserError
match[1] # Return raw string if not valid JSON
end
end
# Example usage
html_with_js = <<~HTML
<script>
var productData = {"name": "Laptop", "price": 999.99};
var userInfo = {"id": 123, "name": "John"};
</script>
HTML
product_data = extract_js_variable(html_with_js, "productData")
puts product_data["name"] # "Laptop"
Extracting URLs and Links
def extract_urls(text)
url_pattern = /https?:\/\/(?:[-\w.])+(?:\:[0-9]+)?(?:\/(?:[\w\/_.])*(?:\?(?:[\w&=%.])*)?(?:\#(?:[\w.])*)?)?/
urls = text.scan(url_pattern)
urls.uniq
end
def extract_relative_links(html)
# Extract href attributes
link_pattern = /href\s*=\s*["']([^"']+)["']/i
links = html.scan(link_pattern).flatten
# Filter for relative links
links.select { |link| !link.start_with?('http', 'mailto:', 'tel:') }
end
Social Media Handles
def extract_social_handles(text)
social_patterns = {
twitter: /@[A-Za-z0-9_]{1,15}/,
instagram: /@[A-Za-z0-9_.]{1,30}/,
linkedin: /linkedin\.com\/in\/([A-Za-z0-9\-._]+)/,
facebook: /facebook\.com\/([A-Za-z0-9.]+)/
}
results = {}
social_patterns.each do |platform, pattern|
matches = text.scan(pattern)
results[platform] = matches.flatten.uniq unless matches.empty?
end
results
end
Combining Regex with HTTP Requests
Here's a complete example that demonstrates scraping a webpage and extracting data using regex:
require 'net/http'
require 'uri'
require 'json'
class RegexScraper
def initialize(url)
@url = url
@content = fetch_content
end
def fetch_content
uri = URI(@url)
response = Net::HTTP.get_response(uri)
if response.code == '200'
response.body
else
raise "Failed to fetch content: #{response.code}"
end
end
def extract_emails
email_pattern = /\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/
@content.scan(email_pattern).uniq
end
def extract_meta_data
meta_data = {}
# Extract title
title_match = @content.match(/<title[^>]*>(.*?)<\/title>/im)
meta_data[:title] = title_match[1].strip if title_match
# Extract meta description
desc_pattern = /<meta\s+name=["']description["']\s+content=["']([^"']+)["']/i
desc_match = @content.match(desc_pattern)
meta_data[:description] = desc_match[1] if desc_match
# Extract structured data (JSON-LD)
json_ld_pattern = /<script[^>]*type=["']application\/ld\+json["'][^>]*>(.*?)<\/script>/im
json_ld_matches = @content.scan(json_ld_pattern)
if json_ld_matches.any?
begin
meta_data[:structured_data] = json_ld_matches.map { |match| JSON.parse(match[0]) }
rescue JSON::ParserError
# Handle malformed JSON
end
end
meta_data
end
def extract_product_info
# Extract product prices
price_pattern = /\$\d{1,3}(?:,\d{3})*(?:\.\d{2})?/
prices = @content.scan(price_pattern).map { |p| p.gsub(/[$,]/, '').to_f }
# Extract product SKUs (example pattern)
sku_pattern = /SKU:\s*([A-Z0-9\-]+)/i
skus = @content.scan(sku_pattern).flatten
{
prices: prices,
skus: skus
}
end
end
# Usage example
begin
scraper = RegexScraper.new('https://example-store.com/product/123')
emails = scraper.extract_emails
meta_data = scraper.extract_meta_data
product_info = scraper.extract_product_info
puts "Found emails: #{emails}"
puts "Page title: #{meta_data[:title]}"
puts "Product prices: #{product_info[:prices]}"
rescue => e
puts "Error: #{e.message}"
end
Best Practices and Performance Tips
1. Compile Regex Patterns
For frequently used patterns, compile them once:
class DataExtractor
EMAIL_PATTERN = /\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/.freeze
PHONE_PATTERN = /\(\d{3}\)\s*\d{3}-\d{4}/.freeze
def extract_contact_info(text)
{
emails: text.scan(EMAIL_PATTERN),
phones: text.scan(PHONE_PATTERN)
}
end
end
2. Use Named Capture Groups
Named groups make your code more readable and maintainable:
def extract_product_details(text)
pattern = /Product:\s*(?<name>.*?)\s*Price:\s*\$(?<price>\d+\.?\d*)/
matches = text.scan(pattern)
matches.map do |name, price|
{ name: name.strip, price: price.to_f }
end
end
3. Handle Edge Cases
def safe_regex_extract(text, pattern, default = [])
return default if text.nil? || text.empty?
begin
text.scan(pattern)
rescue RegexpError => e
puts "Regex error: #{e.message}"
default
end
end
When to Use Regex vs. HTML Parsers
While regex is powerful, it's important to understand when to use it versus dedicated HTML parsers:
Use Regex when: - Extracting data from plain text or JavaScript - Working with non-HTML content - Extracting specific patterns (emails, phones, URLs) - Processing large amounts of text efficiently
Use HTML parsers (like Nokogiri) when: - Working with structured HTML - Need to traverse DOM elements - Extracting data based on HTML structure - Handling complex authentication flows where DOM interaction is required
Error Handling and Validation
Always implement proper error handling when using regex for web scraping:
def robust_data_extraction(html_content)
results = {
emails: [],
phones: [],
urls: [],
errors: []
}
begin
results[:emails] = extract_emails(html_content)
rescue => e
results[:errors] << "Email extraction failed: #{e.message}"
end
begin
results[:phones] = extract_phone_numbers(html_content)
rescue => e
results[:errors] << "Phone extraction failed: #{e.message}"
end
begin
results[:urls] = extract_urls(html_content)
rescue => e
results[:errors] << "URL extraction failed: #{e.message}"
end
results
end
Performance Optimization
For large-scale scraping operations, consider these optimizations:
# Use StringScanner for efficient pattern matching
require 'strscan'
def efficient_pattern_extraction(text, patterns)
scanner = StringScanner.new(text)
results = Hash.new { |h, k| h[k] = [] }
until scanner.eos?
patterns.each do |name, pattern|
if match = scanner.scan(pattern)
results[name] << match
break
end
end
scanner.getch unless scanner.eos?
end
results
end
Conclusion
Regular expressions are invaluable tools for Ruby web scraping, especially when dealing with unstructured text data, JavaScript variables, or specific pattern extraction. While they shouldn't replace proper HTML parsers for structured content, they excel at extracting emails, phone numbers, prices, and other formatted data.
Remember to always validate your regex patterns with diverse test data, handle edge cases gracefully, and consider performance implications when processing large amounts of content. When combined with Ruby's powerful HTTP libraries and HTML parsers, regex becomes part of a comprehensive web scraping toolkit.
For complex scenarios involving browser automation and dynamic content, consider integrating regex extraction with headless browser solutions to handle JavaScript-rendered content effectively.