What are the best practices for handling errors and exceptions in Mechanize scripts?
Error handling is a crucial aspect of building reliable web scraping applications with Mechanize. Robust error handling ensures your scripts can gracefully handle network issues, server errors, parsing problems, and unexpected responses. This guide covers comprehensive strategies for implementing effective error handling in your Mechanize scripts.
Understanding Common Mechanize Exceptions
HTTP-Related Errors
Mechanize can encounter various HTTP-related errors during web scraping operations:
require 'mechanize'
agent = Mechanize.new
begin
page = agent.get('https://example.com/page')
rescue Mechanize::ResponseCodeError => e
puts "HTTP Error: #{e.response_code} - #{e.message}"
case e.response_code
when '404'
puts "Page not found"
when '403'
puts "Access forbidden - check authentication"
when '500'
puts "Server error - try again later"
end
rescue Net::HTTP::Persistent::Error => e
puts "Network connection error: #{e.message}"
end
Timeout Errors
Network timeouts are common when scraping websites with slow response times:
agent = Mechanize.new
agent.open_timeout = 10 # Connection timeout
agent.read_timeout = 30 # Read timeout
begin
page = agent.get('https://slow-website.com')
rescue Net::TimeoutError => e
puts "Request timed out: #{e.message}"
# Implement retry logic or fallback behavior
rescue Timeout::Error => e
puts "Operation timed out: #{e.message}"
end
SSL Certificate Errors
SSL certificate issues can occur when scraping HTTPS websites:
agent = Mechanize.new
agent.verify_mode = OpenSSL::SSL::VERIFY_NONE # Use with caution
begin
page = agent.get('https://self-signed-cert.com')
rescue OpenSSL::SSL::SSLError => e
puts "SSL Error: #{e.message}"
# Handle certificate validation issues
end
Implementing Comprehensive Error Handling
Basic Error Handling Structure
Create a robust error handling framework for your Mechanize scripts:
class MechanizeErrorHandler
def self.with_error_handling
retries = 0
max_retries = 3
begin
yield
rescue Mechanize::ResponseCodeError => e
handle_http_error(e, retries, max_retries)
rescue Net::TimeoutError, Timeout::Error => e
handle_timeout_error(e, retries, max_retries)
rescue SocketError, Errno::ECONNREFUSED => e
handle_network_error(e, retries, max_retries)
rescue StandardError => e
handle_generic_error(e)
end
end
private
def self.handle_http_error(error, retries, max_retries)
case error.response_code
when '429', '503', '502', '504'
retry_request(retries, max_retries, "HTTP #{error.response_code}")
when '404'
puts "Resource not found: #{error.page.uri}"
return nil
else
puts "HTTP Error #{error.response_code}: #{error.message}"
return nil
end
end
def self.handle_timeout_error(error, retries, max_retries)
retry_request(retries, max_retries, "Timeout")
end
def self.handle_network_error(error, retries, max_retries)
retry_request(retries, max_retries, "Network")
end
def self.retry_request(retries, max_retries, error_type)
if retries < max_retries
retries += 1
wait_time = 2 ** retries
puts "#{error_type} error. Retrying in #{wait_time} seconds... (#{retries}/#{max_retries})"
sleep(wait_time)
retry
else
puts "Max retries exceeded for #{error_type} error"
return nil
end
end
def self.handle_generic_error(error)
puts "Unexpected error: #{error.class} - #{error.message}"
puts error.backtrace.first(5).join("\n")
return nil
end
end
Usage Example
def scrape_product_data(url)
agent = Mechanize.new
agent.user_agent_alias = 'Windows Chrome'
MechanizeErrorHandler.with_error_handling do
page = agent.get(url)
# Extract product information
title = page.search('.product-title').text.strip
price = page.search('.price').text.strip
{
title: title,
price: price,
url: url,
scraped_at: Time.now
}
end
end
Advanced Error Handling Strategies
Exponential Backoff for Rate Limiting
Implement exponential backoff when encountering rate limiting:
class RateLimitHandler
def self.with_rate_limit_handling(max_retries: 5)
retries = 0
begin
yield
rescue Mechanize::ResponseCodeError => e
if e.response_code == '429' && retries < max_retries
retries += 1
wait_time = (2 ** retries) + rand(1..5) # Add jitter
puts "Rate limited. Waiting #{wait_time} seconds before retry #{retries}/#{max_retries}"
sleep(wait_time)
retry
else
raise e
end
end
end
end
Circuit Breaker Pattern
Implement a circuit breaker to avoid overwhelming failing services:
class CircuitBreaker
def initialize(failure_threshold: 5, recovery_timeout: 60)
@failure_threshold = failure_threshold
@recovery_timeout = recovery_timeout
@failure_count = 0
@last_failure_time = nil
@state = :closed # :closed, :open, :half_open
end
def call
if @state == :open
if Time.now - @last_failure_time > @recovery_timeout
@state = :half_open
else
raise "Circuit breaker is OPEN"
end
end
begin
result = yield
reset if @state == :half_open
result
rescue StandardError => e
record_failure
raise e
end
end
private
def record_failure
@failure_count += 1
@last_failure_time = Time.now
if @failure_count >= @failure_threshold
@state = :open
puts "Circuit breaker OPENED after #{@failure_count} failures"
end
end
def reset
@failure_count = 0
@state = :closed
puts "Circuit breaker CLOSED - service recovered"
end
end
Form Handling Error Management
Safe Form Submission
Handle errors specific to form interactions:
def submit_form_safely(page, form_data)
begin
form = page.forms.first
raise "No form found on page" if form.nil?
# Populate form fields safely
form_data.each do |field_name, value|
field = form.field(field_name)
if field
field.value = value
else
puts "Warning: Field '#{field_name}' not found in form"
end
end
# Submit form with error handling
result_page = form.submit
# Validate submission success
if result_page.search('.error-message').any?
error_messages = result_page.search('.error-message').map(&:text)
raise "Form submission failed: #{error_messages.join(', ')}"
end
result_page
rescue Mechanize::ElementNotFoundError => e
puts "Form element not found: #{e.message}"
return nil
rescue StandardError => e
puts "Form submission error: #{e.message}"
return nil
end
end
Logging and Monitoring
Comprehensive Logging Setup
Implement detailed logging for debugging and monitoring:
require 'logger'
class MechanizeScraper
def initialize
@agent = Mechanize.new
@logger = Logger.new('scraper.log')
@logger.level = Logger::INFO
setup_mechanize_logging
end
private
def setup_mechanize_logging
@agent.log = @logger
@agent.agent.http.debug_output = $stdout if ENV['DEBUG']
end
def log_request(url, success: true, error: nil)
if success
@logger.info("Successfully scraped: #{url}")
else
@logger.error("Failed to scrape #{url}: #{error}")
end
end
def scrape_with_logging(url)
start_time = Time.now
begin
@logger.info("Starting scrape of: #{url}")
page = @agent.get(url)
duration = Time.now - start_time
@logger.info("Completed scrape of #{url} in #{duration.round(2)}s")
log_request(url, success: true)
page
rescue StandardError => e
duration = Time.now - start_time
@logger.error("Failed scrape of #{url} after #{duration.round(2)}s: #{e.message}")
log_request(url, success: false, error: e.message)
nil
end
end
end
Error Recovery Strategies
Data Validation and Cleanup
Validate scraped data and handle parsing errors:
def validate_and_clean_data(page)
data = {}
begin
# Safe text extraction with fallbacks
data[:title] = extract_text_safely(page, '.title', 'Unknown Title')
data[:price] = extract_price_safely(page, '.price')
data[:description] = extract_text_safely(page, '.description', '')
# Validate required fields
validate_required_fields(data)
data
rescue DataValidationError => e
puts "Data validation failed: #{e.message}"
return nil
end
end
def extract_text_safely(page, selector, default = nil)
element = page.search(selector).first
return default if element.nil?
text = element.text.strip
text.empty? ? default : text
rescue StandardError => e
puts "Error extracting text from #{selector}: #{e.message}"
default
end
def extract_price_safely(page, selector)
price_text = extract_text_safely(page, selector, '0')
# Clean and parse price
cleaned_price = price_text.gsub(/[^\d.,]/, '')
Float(cleaned_price)
rescue ArgumentError
puts "Invalid price format: #{price_text}"
0.0
end
Best Practices Summary
Configuration Best Practices
def configure_robust_agent
agent = Mechanize.new
# Set reasonable timeouts
agent.open_timeout = 10
agent.read_timeout = 30
# Configure user agent rotation
agent.user_agent_alias = ['Windows Chrome', 'Mac Chrome', 'Linux Firefox'].sample
# Handle redirects
agent.redirect_ok = true
agent.redirection_limit = 5
# SSL configuration
agent.verify_mode = OpenSSL::SSL::VERIFY_PEER
# Cookie management
agent.cookie_jar.clear!
agent
end
Error Handling Checklist
- Wrap all network operations in appropriate exception handlers
- Implement retry logic with exponential backoff for transient errors
- Log all errors with sufficient detail for debugging
- Validate data before processing to catch parsing issues early
- Use circuit breakers for external service dependencies
- Set appropriate timeouts to avoid hanging requests
- Handle rate limiting gracefully with proper delays
- Monitor and alert on error rates and patterns
Alternative Approaches
While Mechanize is excellent for form-based scraping, consider how to handle errors in Puppeteer for JavaScript-heavy sites that require more advanced error handling capabilities. For scenarios involving complex user interactions, explore how to handle authentication in Puppeteer which provides additional error handling context for authentication workflows.
Conclusion
Effective error handling in Mechanize scripts requires a multi-layered approach that addresses network issues, HTTP errors, parsing problems, and data validation. By implementing comprehensive error handling strategies including retry logic, circuit breakers, proper logging, and data validation, you can build robust web scraping applications that gracefully handle the unpredictable nature of web environments.
Remember to always test your error handling code with various failure scenarios and monitor your production scrapers to identify and address new error patterns as they emerge.