How do you optimize Mechanize scripts for better performance and reliability?
Optimizing Mechanize scripts is crucial for building robust, efficient web scraping applications. This guide covers comprehensive strategies to enhance performance, improve reliability, and ensure your Mechanize-based scrapers can handle production workloads effectively.
Performance Optimization Strategies
1. Connection and Timeout Management
Proper connection management is fundamental for performance optimization:
require 'mechanize'
# Create an optimized agent
agent = Mechanize.new do |a|
# Set reasonable timeouts
a.open_timeout = 10 # Connection timeout
a.read_timeout = 30 # Read timeout
a.idle_timeout = 5 # Keep-alive timeout
# Enable connection pooling
a.keep_alive = true
a.max_history = 0 # Disable page history to save memory
# Optimize SSL settings
a.verify_mode = OpenSSL::SSL::VERIFY_PEER
a.ca_file = '/etc/ssl/certs/ca-certificates.crt'
end
# Configure user agent rotation
user_agents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36'
]
agent.user_agent = user_agents.sample
2. Memory Management
Prevent memory leaks in long-running scripts:
class OptimizedScraper
def initialize
@agent = Mechanize.new
@agent.max_history = 0 # Crucial for memory optimization
@processed_urls = Set.new
end
def scrape_pages(urls)
urls.each_with_index do |url, index|
begin
process_page(url)
# Periodic garbage collection for large datasets
if index % 100 == 0
GC.start
puts "Processed #{index} pages, memory: #{memory_usage}MB"
end
rescue => e
handle_error(e, url)
end
end
end
private
def process_page(url)
return if @processed_urls.include?(url)
page = @agent.get(url)
extract_data(page)
@processed_urls.add(url)
# Clear page references to free memory
page = nil
end
def memory_usage
`ps -o rss= -p #{Process.pid}`.to_i / 1024
end
end
3. Concurrent Processing
Implement thread-safe concurrent processing for better throughput:
require 'concurrent-ruby'
require 'mechanize'
class ConcurrentScraper
def initialize(max_threads: 5)
@max_threads = max_threads
@thread_pool = Concurrent::ThreadPoolExecutor.new(
min_threads: 1,
max_threads: @max_threads,
max_queue: 100
)
end
def scrape_urls(urls)
futures = urls.map do |url|
Concurrent::Future.execute(executor: @thread_pool) do
scrape_single_url(url)
end
end
# Wait for all tasks to complete
results = futures.map(&:value)
@thread_pool.shutdown
@thread_pool.wait_for_termination
results.compact
end
private
def scrape_single_url(url)
# Each thread gets its own agent instance
agent = create_agent
begin
page = agent.get(url)
extract_data(page)
rescue => e
puts "Error scraping #{url}: #{e.message}"
nil
ensure
agent&.shutdown
end
end
def create_agent
Mechanize.new do |a|
a.open_timeout = 10
a.read_timeout = 30
a.max_history = 0
a.user_agent = random_user_agent
end
end
end
Reliability Enhancement Techniques
1. Robust Error Handling and Retry Logic
Implement comprehensive error handling with exponential backoff:
class ReliableScraper
MAX_RETRIES = 3
BASE_DELAY = 1
def fetch_with_retry(url, retries: 0)
agent.get(url)
rescue Net::TimeoutError, Net::HTTPError => e
if retries < MAX_RETRIES
delay = BASE_DELAY * (2 ** retries) + rand(0.1..0.5)
puts "Retry #{retries + 1} for #{url} after #{delay}s (#{e.class})"
sleep(delay)
fetch_with_retry(url, retries: retries + 1)
else
puts "Failed to fetch #{url} after #{MAX_RETRIES} retries"
handle_permanent_failure(url, e)
nil
end
rescue Mechanize::ResponseCodeError => e
case e.response_code
when '404', '410'
puts "Page not found: #{url}"
nil
when '429', '503'
# Rate limited or service unavailable
backoff_time = extract_retry_after(e.page) || 60
puts "Rate limited, backing off for #{backoff_time}s"
sleep(backoff_time)
fetch_with_retry(url, retries: retries + 1) if retries < MAX_RETRIES
else
raise e
end
end
private
def extract_retry_after(page)
retry_after = page.response['Retry-After']
retry_after&.to_i
end
def handle_permanent_failure(url, error)
# Log to file, database, or monitoring system
File.open('failed_urls.log', 'a') do |f|
f.puts "#{Time.now}: #{url} - #{error.class}: #{error.message}"
end
end
end
2. Rate Limiting and Respectful Scraping
Implement adaptive rate limiting to avoid being blocked:
class RateLimitedScraper
def initialize
@agent = Mechanize.new
@last_request_time = Time.now
@request_count = 0
@rate_limit_window = 60 # seconds
@max_requests_per_window = 60
end
def get_page(url)
enforce_rate_limit
start_time = Time.now
page = @agent.get(url)
response_time = Time.now - start_time
# Adaptive delay based on response time
adaptive_delay = calculate_adaptive_delay(response_time)
sleep(adaptive_delay) if adaptive_delay > 0
@request_count += 1
@last_request_time = Time.now
page
end
private
def enforce_rate_limit
now = Time.now
window_start = now - @rate_limit_window
if @request_count >= @max_requests_per_window
sleep_time = @rate_limit_window - (now - @last_request_time)
if sleep_time > 0
puts "Rate limit reached, sleeping for #{sleep_time.round(2)}s"
sleep(sleep_time)
@request_count = 0
end
end
end
def calculate_adaptive_delay(response_time)
case response_time
when 0..1
0.5 # Fast response, minimal delay
when 1..3
1.0 # Normal response, standard delay
when 3..10
2.0 # Slow response, longer delay
else
5.0 # Very slow response, significant delay
end
end
end
3. Session Management and Cookie Persistence
Maintain session state across requests for better reliability:
class SessionAwareScraper
def initialize(cookie_jar_path: 'cookies.yml')
@cookie_jar_path = cookie_jar_path
@agent = Mechanize.new
load_cookies
# Set up automatic cookie saving
at_exit { save_cookies }
end
def login(username, password, login_url)
login_page = @agent.get(login_url)
login_form = login_page.form_with(action: /login|signin/)
return false unless login_form
login_form.field_with(name: /username|email/).value = username
login_form.field_with(name: /password/).value = password
result_page = @agent.submit(login_form)
# Verify login success
login_successful = !result_page.uri.to_s.include?('login') &&
!result_page.search('error, .alert-danger').any?
save_cookies if login_successful
login_successful
end
def scrape_with_session(urls)
results = []
urls.each do |url|
begin
page = @agent.get(url)
# Check if session expired
if session_expired?(page)
puts "Session expired, attempting re-login..."
if re_authenticate
page = @agent.get(url) # Retry after re-authentication
else
puts "Re-authentication failed"
break
end
end
results << extract_data(page)
rescue => e
puts "Error processing #{url}: #{e.message}"
end
end
results
end
private
def load_cookies
if File.exist?(@cookie_jar_path)
@agent.cookie_jar.load(@cookie_jar_path)
puts "Loaded cookies from #{@cookie_jar_path}"
end
end
def save_cookies
@agent.cookie_jar.save(@cookie_jar_path)
puts "Saved cookies to #{@cookie_jar_path}"
end
def session_expired?(page)
page.uri.to_s.include?('login') ||
page.search('login-required, .session-expired').any?
end
end
Advanced Optimization Techniques
1. Proxy Rotation and IP Management
For large-scale scraping, implement proxy rotation to distribute requests:
class ProxyRotatingScraper
def initialize(proxy_list)
@proxies = proxy_list.cycle
@current_proxy = nil
@failed_proxies = Set.new
@agent = nil
setup_agent
end
def get_page_with_proxy_rotation(url, max_proxy_attempts: 3)
attempts = 0
begin
page = @agent.get(url)
reset_proxy_failure_count if page
page
rescue => e
attempts += 1
puts "Proxy #{@current_proxy} failed: #{e.message}"
if attempts < max_proxy_attempts
rotate_proxy
retry
else
raise "All proxy attempts failed for #{url}"
end
end
end
private
def setup_agent
rotate_proxy
end
def rotate_proxy
loop do
@current_proxy = @proxies.next
break unless @failed_proxies.include?(@current_proxy)
# If all proxies failed, clear the failed set and try again
if @failed_proxies.size >= proxy_count
@failed_proxies.clear
puts "Cleared failed proxies list, retrying all proxies"
break
end
end
create_agent_with_proxy(@current_proxy)
end
def create_agent_with_proxy(proxy)
@agent = Mechanize.new do |a|
if proxy[:type] == 'http'
a.set_proxy(proxy[:host], proxy[:port], proxy[:user], proxy[:password])
end
a.open_timeout = 15
a.read_timeout = 30
a.user_agent = random_user_agent
end
puts "Using proxy: #{proxy[:host]}:#{proxy[:port]}"
end
def reset_proxy_failure_count
@failed_proxies.delete(@current_proxy)
end
end
2. Intelligent Content Parsing
Optimize parsing for better performance and reliability:
class OptimizedParser
def extract_data_efficiently(page)
# Use CSS selectors instead of XPath when possible (faster)
title = page.at_css('h1, .title, [data-title]')&.text&.strip
# Cache frequently used selectors
@content_selector ||= 'article, .content, .post-body, main'
content = page.at_css(@content_selector)&.text&.strip
# Batch process multiple elements
links = page.css('a[href]').map do |link|
{
text: link.text.strip,
href: link['href'],
title: link['title']
}
end.reject { |link| link[:text].empty? }
# Use lazy evaluation for expensive operations
images = lazy_extract_images(page) if needs_images?
{
title: title,
content: content,
links: links,
images: images,
scraped_at: Time.now
}
end
private
def lazy_extract_images(page)
page.css('img[src]').lazy.map do |img|
src = img['src']
next if src.nil? || src.start_with?('data:')
{
src: absolute_url(src, page.uri),
alt: img['alt'],
title: img['title']
}
end.reject(&:nil?).force
end
def absolute_url(relative_url, base_uri)
URI.join(base_uri, relative_url).to_s
rescue URI::InvalidURIError
relative_url
end
end
3. Monitoring and Logging
Implement comprehensive monitoring for production environments:
class MonitoredScraper
def initialize
@agent = Mechanize.new
@stats = {
requests: 0,
successes: 0,
failures: 0,
start_time: Time.now
}
setup_logging
end
def scrape_with_monitoring(urls)
urls.each do |url|
begin
start_time = Time.now
page = @agent.get(url)
response_time = Time.now - start_time
log_success(url, response_time)
@stats[:successes] += 1
yield page if block_given?
rescue => e
log_error(url, e)
@stats[:failures] += 1
ensure
@stats[:requests] += 1
# Report stats periodically
report_stats if @stats[:requests] % 100 == 0
end
end
final_report
end
private
def setup_logging
@logger = Logger.new('scraper.log', 'daily')
@logger.level = Logger::INFO
@logger.formatter = proc do |severity, datetime, progname, msg|
"#{datetime.strftime('%Y-%m-%d %H:%M:%S')} [#{severity}] #{msg}\n"
end
end
def log_success(url, response_time)
@logger.info("SUCCESS: #{url} (#{response_time.round(3)}s)")
end
def log_error(url, error)
@logger.error("ERROR: #{url} - #{error.class}: #{error.message}")
end
def report_stats
uptime = Time.now - @stats[:start_time]
success_rate = (@stats[:successes].to_f / @stats[:requests] * 100).round(2)
rate = (@stats[:requests] / uptime * 60).round(2)
puts "\n--- Stats Report ---"
puts "Requests: #{@stats[:requests]}"
puts "Success Rate: #{success_rate}%"
puts "Rate: #{rate} requests/minute"
puts "Uptime: #{uptime.round(0)}s"
puts "-------------------\n"
end
def final_report
report_stats
@logger.info("Scraping session completed: #{@stats}")
end
end
Database Integration and Data Storage
Efficiently store scraped data to avoid bottlenecks:
require 'sqlite3'
require 'json'
class DatabaseOptimizedScraper
def initialize(db_path: 'scraped_data.db')
@db = SQLite3::Database.new(db_path)
@batch_size = 100
@batch_data = []
setup_database
end
def scrape_and_store(urls)
urls.each_with_index do |url, index|
begin
page = @agent.get(url)
data = extract_data(page)
# Batch insert for better performance
@batch_data << data
if @batch_data.size >= @batch_size
insert_batch
@batch_data.clear
end
rescue => e
puts "Error processing #{url}: #{e.message}"
end
end
# Insert remaining data
insert_batch unless @batch_data.empty?
end
private
def setup_database
@db.execute <<-SQL
CREATE TABLE IF NOT EXISTS scraped_pages (
id INTEGER PRIMARY KEY AUTOINCREMENT,
url TEXT UNIQUE,
title TEXT,
content TEXT,
metadata TEXT,
scraped_at DATETIME DEFAULT CURRENT_TIMESTAMP
)
SQL
# Create index for faster lookups
@db.execute "CREATE INDEX IF NOT EXISTS idx_url ON scraped_pages(url)"
end
def insert_batch
@db.transaction do
stmt = @db.prepare(
"INSERT OR REPLACE INTO scraped_pages (url, title, content, metadata)
VALUES (?, ?, ?, ?)"
)
@batch_data.each do |data|
stmt.execute(
data[:url],
data[:title],
data[:content],
data[:metadata].to_json
)
end
stmt.close
end
puts "Inserted batch of #{@batch_data.size} records"
end
end
Configuration Management
Centralize configuration for easier optimization:
# config/scraper_config.yml
development:
timeouts:
open_timeout: 10
read_timeout: 30
idle_timeout: 5
rate_limiting:
requests_per_minute: 60
adaptive_delay: true
respect_retry_after: true
concurrency:
max_threads: 5
thread_pool_size: 10
memory:
max_history: 0
gc_frequency: 100
reliability:
max_retries: 3
base_delay: 1
exponential_backoff: true
production:
timeouts:
open_timeout: 15
read_timeout: 45
idle_timeout: 10
rate_limiting:
requests_per_minute: 30
adaptive_delay: true
respect_retry_after: true
concurrency:
max_threads: 10
thread_pool_size: 20
memory:
max_history: 0
gc_frequency: 50
reliability:
max_retries: 5
base_delay: 2
exponential_backoff: true
require 'yaml'
class ConfigurableScraper
def initialize(env: 'development')
@config = YAML.load_file('config/scraper_config.yml')[env]
@agent = create_configured_agent
end
private
def create_configured_agent
Mechanize.new do |a|
# Apply timeout settings
a.open_timeout = @config['timeouts']['open_timeout']
a.read_timeout = @config['timeouts']['read_timeout']
a.idle_timeout = @config['timeouts']['idle_timeout']
# Apply memory settings
a.max_history = @config['memory']['max_history']
a.keep_alive = true
# Set up user agent rotation
a.user_agent = random_user_agent
end
end
end
Best Practices Summary
Performance Optimization Checklist
- Connection Management: Use keep-alive connections and appropriate timeouts
- Memory Management: Disable page history and implement periodic garbage collection
- Concurrent Processing: Use thread pools for parallel processing
- Caching: Cache frequently accessed elements and selectors
- Efficient Parsing: Prefer CSS selectors over XPath when possible
- Database Optimization: Use batch inserts and proper indexing
Reliability Enhancement Checklist
- Error Handling: Implement retry logic with exponential backoff
- Rate Limiting: Respect server resources with adaptive delays
- Session Management: Persist cookies and handle session expiration
- Monitoring: Log all activities and track performance metrics
- Proxy Rotation: Distribute requests across multiple IP addresses
- Configuration Management: Use environment-specific settings
Command Line Monitoring
Monitor your scraper's performance in real-time:
# Monitor memory usage
watch -n 5 'ps aux | grep ruby | grep -v grep'
# Monitor network connections
netstat -an | grep :80 | wc -l
# Monitor log files
tail -f scraper.log | grep ERROR
# Check system load
uptime && free -h
When building large-scale scrapers, consider complementing Mechanize with browser automation tools for JavaScript-heavy sites or implementing robust error handling patterns that can be adapted across different scraping technologies.
By implementing these optimization strategies, your Mechanize scripts will be more performant, reliable, and capable of handling production workloads while maintaining respectful scraping practices. Remember to always monitor your scrapers in production and adjust configurations based on the specific requirements of your target websites.