What are the best practices for organizing Ruby web scraping code?
Organizing Ruby web scraping code effectively is crucial for building maintainable, scalable, and robust scraping applications. Whether you're building a simple data extraction script or a complex web scraping system, following established patterns and best practices will save you time and reduce bugs in the long run.
Project Structure and Architecture
1. Modular Design with Separation of Concerns
Organize your Ruby web scraping project using a modular approach that separates different responsibilities:
project_root/
├── lib/
│ ├── scrapers/
│ │ ├── base_scraper.rb
│ │ ├── product_scraper.rb
│ │ └── news_scraper.rb
│ ├── parsers/
│ │ ├── html_parser.rb
│ │ └── json_parser.rb
│ ├── models/
│ │ ├── product.rb
│ │ └── article.rb
│ ├── storage/
│ │ ├── database_storage.rb
│ │ ├── csv_storage.rb
│ │ └── json_storage.rb
│ └── utilities/
│ ├── http_client.rb
│ ├── rate_limiter.rb
│ └── logger.rb
├── config/
│ ├── settings.yml
│ └── database.yml
├── spec/
└── Gemfile
2. Base Scraper Class Pattern
Create a base scraper class that provides common functionality:
# lib/scrapers/base_scraper.rb
require 'net/http'
require 'nokogiri'
require 'logger'
class BaseScraper
attr_reader :logger, :http_client, :rate_limiter
def initialize(options = {})
@logger = Logger.new(STDOUT)
@http_client = HTTPClient.new(options[:http_options] || {})
@rate_limiter = RateLimiter.new(options[:rate_limit] || 1)
@retries = options[:retries] || 3
end
def scrape(url)
raise NotImplementedError, "Subclasses must implement #scrape"
end
protected
def fetch_page(url)
rate_limiter.wait
@retries.times do |attempt|
begin
response = http_client.get(url)
return handle_response(response) if response.code == '200'
logger.warn "HTTP #{response.code} for #{url}, attempt #{attempt + 1}"
sleep(2 ** attempt) # Exponential backoff
rescue StandardError => e
logger.error "Error fetching #{url}: #{e.message}"
raise e if attempt == @retries - 1
end
end
end
def handle_response(response)
Nokogiri::HTML(response.body)
end
def parse_page(document)
raise NotImplementedError, "Subclasses must implement #parse_page"
end
end
3. Specific Scraper Implementation
Implement specific scrapers that inherit from the base class:
# lib/scrapers/product_scraper.rb
class ProductScraper < BaseScraper
def scrape(url)
document = fetch_page(url)
products = parse_page(document)
logger.info "Scraped #{products.length} products from #{url}"
products
end
private
def parse_page(document)
document.css('.product-item').map do |product_element|
{
name: extract_text(product_element, '.product-title'),
price: extract_price(product_element, '.price'),
image_url: extract_attribute(product_element, '.product-image img', 'src'),
availability: extract_availability(product_element)
}
end
end
def extract_text(element, selector)
element.at_css(selector)&.text&.strip
end
def extract_price(element, selector)
price_text = extract_text(element, selector)
return nil unless price_text
price_text.gsub(/[^\d.]/, '').to_f
end
def extract_attribute(element, selector, attribute)
element.at_css(selector)&.attribute(attribute)&.value
end
def extract_availability(element)
availability_element = element.at_css('.availability')
return 'unknown' unless availability_element
availability_element.text.strip.downcase.include?('in stock')
end
end
Configuration Management
4. Centralized Configuration
Use a configuration management approach to handle different environments and settings:
# lib/utilities/config.rb
require 'yaml'
class Config
def self.load(environment = 'development')
config_file = File.join(File.dirname(__FILE__), '..', '..', 'config', 'settings.yml')
@config ||= YAML.load_file(config_file)[environment]
end
def self.get(key)
load[key.to_s]
end
end
# config/settings.yml
development:
user_agents:
- "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
- "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36"
rate_limit: 2
timeout: 30
retries: 3
storage:
type: 'json'
path: 'data/scraped_data.json'
production:
user_agents:
- "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
rate_limit: 1
timeout: 60
retries: 5
storage:
type: 'database'
connection_string: 'postgresql://localhost/scraping_db'
Error Handling and Resilience
5. Robust Error Handling
Implement comprehensive error handling with proper logging and recovery mechanisms:
# lib/utilities/error_handler.rb
class ErrorHandler
RETRYABLE_ERRORS = [
Net::TimeoutError,
Net::OpenTimeout,
Net::ReadTimeout,
Errno::ECONNRESET,
OpenSSL::SSL::SSLError
].freeze
def self.handle_with_retry(max_retries: 3, base_delay: 1)
retries = 0
begin
yield
rescue *RETRYABLE_ERRORS => e
retries += 1
if retries <= max_retries
delay = base_delay * (2 ** (retries - 1))
Rails.logger.warn "Retrying after error: #{e.message}. Attempt #{retries}/#{max_retries}. Waiting #{delay}s"
sleep(delay)
retry
else
Rails.logger.error "Max retries exceeded. Last error: #{e.message}"
raise e
end
rescue StandardError => e
Rails.logger.error "Non-retryable error: #{e.message}\n#{e.backtrace.join("\n")}"
raise e
end
end
end
6. HTTP Client with Best Practices
Create a robust HTTP client with proper headers and connection management:
# lib/utilities/http_client.rb
require 'net/http'
require 'uri'
class HTTPClient
USER_AGENTS = Config.get('user_agents')
def initialize(options = {})
@timeout = options[:timeout] || Config.get('timeout')
@headers = default_headers.merge(options[:headers] || {})
end
def get(url, additional_headers = {})
uri = URI.parse(url)
http = create_http_connection(uri)
request = Net::HTTP::Get.new(uri.request_uri)
apply_headers(request, additional_headers)
ErrorHandler.handle_with_retry do
http.request(request)
end
end
private
def create_http_connection(uri)
http = Net::HTTP.new(uri.host, uri.port)
http.use_ssl = uri.scheme == 'https'
http.read_timeout = @timeout
http.open_timeout = @timeout
http
end
def apply_headers(request, additional_headers = {})
headers = @headers.merge(additional_headers)
headers.each { |key, value| request[key] = value }
end
def default_headers
{
'User-Agent' => USER_AGENTS.sample,
'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language' => 'en-US,en;q=0.5',
'Accept-Encoding' => 'gzip, deflate',
'Connection' => 'keep-alive',
'Upgrade-Insecure-Requests' => '1'
}
end
end
Data Storage and Processing
7. Storage Abstraction Layer
Implement a storage abstraction that allows easy switching between different storage backends:
# lib/storage/base_storage.rb
class BaseStorage
def save(data)
raise NotImplementedError, "Subclasses must implement #save"
end
def load
raise NotImplementedError, "Subclasses must implement #load"
end
end
# lib/storage/json_storage.rb
class JsonStorage < BaseStorage
def initialize(file_path)
@file_path = file_path
end
def save(data)
File.write(@file_path, JSON.pretty_generate(data))
end
def load
return [] unless File.exist?(@file_path)
JSON.parse(File.read(@file_path))
end
end
# lib/storage/database_storage.rb
class DatabaseStorage < BaseStorage
def initialize(model_class)
@model_class = model_class
end
def save(data)
data.each do |record|
@model_class.create!(record)
end
end
def load
@model_class.all.map(&:attributes)
end
end
Rate Limiting and Respect for Servers
8. Intelligent Rate Limiting
Implement rate limiting to be respectful to target websites:
# lib/utilities/rate_limiter.rb
class RateLimiter
def initialize(requests_per_second = 1)
@delay = 1.0 / requests_per_second
@last_request_time = 0
end
def wait
current_time = Time.now.to_f
time_since_last_request = current_time - @last_request_time
if time_since_last_request < @delay
sleep_time = @delay - time_since_last_request
sleep(sleep_time)
end
@last_request_time = Time.now.to_f
end
end
Testing and Quality Assurance
9. Comprehensive Testing Strategy
Write tests for your scraping components using RSpec:
# spec/scrapers/product_scraper_spec.rb
require 'spec_helper'
RSpec.describe ProductScraper do
let(:scraper) { ProductScraper.new }
let(:sample_html) { File.read('spec/fixtures/product_page.html') }
describe '#parse_page' do
let(:document) { Nokogiri::HTML(sample_html) }
it 'extracts product information correctly' do
products = scraper.send(:parse_page, document)
expect(products).to be_an(Array)
expect(products.first).to include(
name: 'Sample Product',
price: 29.99,
image_url: 'https://example.com/image.jpg'
)
end
it 'handles missing elements gracefully' do
empty_document = Nokogiri::HTML('<html><body></body></html>')
products = scraper.send(:parse_page, empty_document)
expect(products).to be_empty
end
end
end
10. Monitoring and Logging
Implement comprehensive logging and monitoring:
# lib/utilities/scraping_logger.rb
class ScrapingLogger
def self.setup
logger = Logger.new(STDOUT)
logger.level = Logger::INFO
logger.formatter = proc do |severity, datetime, progname, msg|
"[#{datetime}] #{severity}: #{msg}\n"
end
logger
end
def self.log_scraping_session(scraper_name, url, results_count)
logger = setup
logger.info "#{scraper_name} completed: #{url} -> #{results_count} items"
end
def self.log_error(scraper_name, url, error)
logger = setup
logger.error "#{scraper_name} failed: #{url} -> #{error.message}"
end
end
Advanced Patterns and Considerations
11. Handling JavaScript-Heavy Sites
For sites requiring JavaScript execution, integrate with headless browsers:
# lib/scrapers/js_scraper.rb
require 'selenium-webdriver'
class JSScraper < BaseScraper
def initialize(options = {})
super
@driver = setup_driver(options[:headless] != false)
end
def scrape(url)
@driver.navigate.to(url)
wait_for_content_load
document = Nokogiri::HTML(@driver.page_source)
parse_page(document)
ensure
@driver&.quit
end
private
def setup_driver(headless = true)
options = Selenium::WebDriver::Chrome::Options.new
options.add_argument('--headless') if headless
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
Selenium::WebDriver.for(:chrome, options: options)
end
def wait_for_content_load
wait = Selenium::WebDriver::Wait.new(timeout: 10)
wait.until { @driver.find_element(css: '.content-loaded') }
rescue Selenium::WebDriver::Error::TimeoutError
logger.warn "Content load timeout, proceeding anyway"
end
end
12. Concurrent Processing
Implement concurrent processing for better performance while respecting rate limits:
# lib/utilities/concurrent_processor.rb
require 'concurrent'
class ConcurrentProcessor
def initialize(max_threads: 4, rate_limiter: nil)
@pool = Concurrent::ThreadPoolExecutor.new(
min_threads: 1,
max_threads: max_threads,
max_queue: 0
)
@rate_limiter = rate_limiter
end
def process_urls(urls, &block)
futures = urls.map do |url|
Concurrent::Future.execute(executor: @pool) do
@rate_limiter&.wait
yield(url)
rescue StandardError => e
Rails.logger.error "Error processing #{url}: #{e.message}"
nil
end
end
futures.map(&:value).compact
ensure
@pool.shutdown
@pool.wait_for_termination(30)
end
end
# Usage example
processor = ConcurrentProcessor.new(max_threads: 3, rate_limiter: RateLimiter.new(2))
scraper = ProductScraper.new
results = processor.process_urls(urls) do |url|
scraper.scrape(url)
end
Command Line Interface
13. CLI for Easy Operation
Create a command-line interface for your scraper:
#!/usr/bin/env ruby
# bin/scraper
require_relative '../lib/scrapers/product_scraper'
require 'optparse'
options = {}
OptionParser.new do |opts|
opts.banner = "Usage: scraper [options] URL [URL...]"
opts.on("-o", "--output FILE", "Output file path") do |file|
options[:output] = file
end
opts.on("-f", "--format FORMAT", "Output format (json, csv)") do |format|
options[:format] = format
end
opts.on("-r", "--rate RATE", "Requests per second") do |rate|
options[:rate] = rate.to_f
end
opts.on("-v", "--verbose", "Verbose logging") do
options[:verbose] = true
end
end.parse!
if ARGV.empty?
puts "Error: No URLs provided"
exit 1
end
scraper = ProductScraper.new(rate_limit: options[:rate] || 1)
results = []
ARGV.each do |url|
puts "Scraping #{url}..." if options[:verbose]
results.concat(scraper.scrape(url))
end
# Save results
if options[:output]
case options[:format]&.downcase
when 'csv'
# Save as CSV
else
File.write(options[:output], JSON.pretty_generate(results))
end
else
puts JSON.pretty_generate(results)
end
Documentation and Maintenance
14. Comprehensive Documentation
Document your scraping project thoroughly:
# README.md structure
# Project Name
# Description
# Installation
# Configuration
# Usage Examples
# API Documentation
# Contributing Guidelines
# License
Create inline documentation for complex methods:
class ProductScraper < BaseScraper
# Extracts product information from an e-commerce page
#
# @param url [String] The URL of the product listing page
# @return [Array<Hash>] Array of product hashes with keys:
# - name [String] Product name
# - price [Float] Product price in USD
# - image_url [String] URL of product image
# - availability [Boolean] Whether product is in stock
#
# @example
# scraper = ProductScraper.new
# products = scraper.scrape('https://example.com/products')
# products.first[:name] # => "Sample Product"
def scrape(url)
# Implementation
end
end
Deployment and Production Considerations
15. Environment-Specific Configuration
Organize configuration for different deployment environments:
# config/environments/production.rb
class ProductionConfig < Config
def self.settings
{
rate_limit: 0.5, # Be more conservative in production
timeout: 60,
retries: 5,
user_agents: load_user_agents_from_file,
proxy_rotation: true,
monitoring: {
enabled: true,
endpoint: ENV['MONITORING_ENDPOINT']
}
}
end
end
Conclusion
Organizing Ruby web scraping code effectively requires careful attention to modularity, error handling, configuration management, and testing. By following these best practices, you'll create maintainable and robust scraping applications that can handle the complexities of modern web scraping challenges.
Key takeaways include using inheritance patterns for scrapers, implementing proper error handling and rate limiting, abstracting storage concerns, and maintaining comprehensive test coverage. These patterns will help you build scalable web scraping solutions that are both efficient and respectful to target websites.
Remember to always respect robots.txt files, implement appropriate delays between requests, and consider the legal and ethical implications of your scraping activities. For more complex scenarios involving JavaScript-heavy sites, consider integrating headless browser automation tools or using specialized browser session management techniques to ensure reliable data extraction.