How do You Implement Custom Error Handling for Network Timeouts in Mechanize?
Network timeouts are one of the most common challenges in web scraping, especially when dealing with slow or unreliable websites. Mechanize provides several mechanisms for handling timeouts, but implementing custom error handling ensures your scrapers are robust and can gracefully recover from network issues.
Understanding Timeout Types in Mechanize
Mechanize handles several types of timeouts that you need to consider when implementing custom error handling:
Connection Timeout
The time limit for establishing a connection to the server:
require 'mechanize'
agent = Mechanize.new
agent.open_timeout = 10 # 10 seconds to establish connection
Read Timeout
The time limit for reading data from an established connection:
agent.read_timeout = 30 # 30 seconds to read response
Idle Timeout
How long to keep connections alive for reuse:
agent.idle_timeout = 5 # Close idle connections after 5 seconds
Basic Timeout Error Handling
The most straightforward approach is to wrap your Mechanize requests in rescue blocks:
require 'mechanize'
require 'timeout'
def scrape_with_basic_timeout_handling(url)
agent = Mechanize.new
agent.open_timeout = 10
agent.read_timeout = 30
begin
page = agent.get(url)
return page
rescue Net::OpenTimeout => e
puts "Connection timeout: #{e.message}"
return nil
rescue Net::ReadTimeout => e
puts "Read timeout: #{e.message}"
return nil
rescue Timeout::Error => e
puts "General timeout: #{e.message}"
return nil
end
end
Implementing Retry Logic with Exponential Backoff
For production scraping, you'll want sophisticated retry mechanisms:
class TimeoutHandler
MAX_RETRIES = 3
BASE_DELAY = 1
def self.with_retries(max_retries = MAX_RETRIES)
attempt = 0
begin
attempt += 1
yield
rescue Net::OpenTimeout, Net::ReadTimeout, Timeout::Error => e
if attempt <= max_retries
delay = BASE_DELAY * (2 ** (attempt - 1)) # Exponential backoff
puts "Timeout on attempt #{attempt}/#{max_retries}. Retrying in #{delay}s..."
sleep(delay)
retry
else
puts "Failed after #{max_retries} attempts: #{e.message}"
raise e
end
end
end
end
# Usage example
def scrape_with_retry(url)
agent = Mechanize.new
agent.open_timeout = 10
agent.read_timeout = 30
TimeoutHandler.with_retries do
agent.get(url)
end
end
Advanced Custom Error Handler Class
For complex scraping operations, create a dedicated error handler:
class MechanizeTimeoutHandler
attr_accessor :max_retries, :base_delay, :max_delay, :backoff_multiplier
def initialize(options = {})
@max_retries = options[:max_retries] || 3
@base_delay = options[:base_delay] || 1
@max_delay = options[:max_delay] || 60
@backoff_multiplier = options[:backoff_multiplier] || 2
@logger = options[:logger] || Logger.new(STDOUT)
end
def execute_with_timeout_handling(description = "Operation")
attempt = 0
start_time = Time.now
begin
attempt += 1
@logger.info("#{description} - Attempt #{attempt}/#{@max_retries + 1}")
result = yield
duration = Time.now - start_time
@logger.info("#{description} completed successfully in #{duration.round(2)}s")
return result
rescue Net::OpenTimeout => e
handle_timeout_error(e, "Connection timeout", attempt, description)
rescue Net::ReadTimeout => e
handle_timeout_error(e, "Read timeout", attempt, description)
rescue Timeout::Error => e
handle_timeout_error(e, "General timeout", attempt, description)
rescue => e
@logger.error("#{description} failed with unexpected error: #{e.message}")
raise e
end
end
private
def handle_timeout_error(error, error_type, attempt, description)
if attempt <= @max_retries
delay = calculate_delay(attempt)
@logger.warn("#{description} - #{error_type} on attempt #{attempt}. Retrying in #{delay}s...")
sleep(delay)
retry
else
total_time = Time.now - @start_time rescue 0
@logger.error("#{description} failed after #{@max_retries + 1} attempts (#{total_time.round(2)}s): #{error.message}")
raise error
end
end
def calculate_delay(attempt)
delay = @base_delay * (@backoff_multiplier ** (attempt - 1))
[delay, @max_delay].min
end
end
Using the Advanced Handler
# Initialize the handler with custom settings
timeout_handler = MechanizeTimeoutHandler.new(
max_retries: 5,
base_delay: 2,
max_delay: 30,
backoff_multiplier: 1.5
)
# Use it for scraping operations
def scrape_multiple_pages(urls)
agent = Mechanize.new
agent.open_timeout = 15
agent.read_timeout = 45
results = []
urls.each_with_index do |url, index|
begin
page = timeout_handler.execute_with_timeout_handling("Scraping page #{index + 1}") do
agent.get(url)
end
results << extract_data(page)
rescue => e
puts "Skipping #{url} due to persistent errors: #{e.message}"
results << nil
end
# Add delay between requests to be respectful
sleep(1)
end
results
end
Handling Specific Timeout Scenarios
Slow Loading Pages
For pages that consistently load slowly, adjust timeouts dynamically:
def scrape_slow_site(url)
agent = Mechanize.new
# Start with generous timeouts for known slow sites
agent.open_timeout = 30
agent.read_timeout = 120
begin
page = agent.get(url)
return page
rescue Net::ReadTimeout => e
# If read timeout occurs, the connection was established
# but the server is very slow - try once more with even longer timeout
puts "Server is very slow, extending timeout..."
agent.read_timeout = 300 # 5 minutes
begin
return agent.get(url)
rescue Net::ReadTimeout
puts "Server too slow even with extended timeout"
raise e
end
end
end
JavaScript-Heavy Sites
When dealing with sites that require JavaScript execution, you might want to integrate with tools that handle timeouts in browser automation:
def scrape_with_fallback_to_browser(url)
# First try with Mechanize (faster)
begin
agent = Mechanize.new
agent.open_timeout = 10
agent.read_timeout = 30
return agent.get(url)
rescue Net::OpenTimeout, Net::ReadTimeout => e
puts "Mechanize failed, falling back to browser automation..."
# Fallback to browser automation for JavaScript-heavy sites
# This would integrate with Puppeteer or Selenium
return scrape_with_browser(url)
end
end
Monitoring and Alerting
Implement monitoring to track timeout patterns:
class TimeoutMonitor
def initialize
@timeout_stats = Hash.new(0)
@total_requests = 0
end
def record_timeout(url, error_type)
@timeout_stats["#{url}_#{error_type}"] += 1
@total_requests += 1
# Alert if timeout rate is too high
if timeout_rate > 0.1 # 10% threshold
alert_high_timeout_rate
end
end
def timeout_rate
total_timeouts = @timeout_stats.values.sum
return 0 if @total_requests == 0
total_timeouts.to_f / @total_requests
end
def alert_high_timeout_rate
puts "WARNING: High timeout rate detected (#{(timeout_rate * 100).round(1)}%)"
# Implement your alerting logic here
end
def report
puts "Timeout Statistics:"
puts "Total requests: #{@total_requests}"
puts "Timeout rate: #{(timeout_rate * 100).round(2)}%"
puts "Breakdown:"
@timeout_stats.each do |key, count|
puts " #{key}: #{count}"
end
end
end
Best Practices for Timeout Handling
1. Set Appropriate Timeouts
- Connection timeout: 10-15 seconds for most sites
- Read timeout: 30-60 seconds depending on expected response size
- Longer timeouts: For APIs or sites known to be slow
2. Implement Circuit Breaker Pattern
class CircuitBreaker
def initialize(failure_threshold = 5, timeout_period = 60)
@failure_threshold = failure_threshold
@timeout_period = timeout_period
@failure_count = 0
@last_failure_time = nil
@state = :closed # :closed, :open, :half_open
end
def call
case @state
when :open
if Time.now - @last_failure_time > @timeout_period
@state = :half_open
else
raise "Circuit breaker is OPEN"
end
end
begin
result = yield
@failure_count = 0
@state = :closed
result
rescue Net::OpenTimeout, Net::ReadTimeout => e
@failure_count += 1
@last_failure_time = Time.now
if @failure_count >= @failure_threshold
@state = :open
end
raise e
end
end
end
3. Graceful Degradation
Always have fallback strategies when timeouts occur persistently.
4. Log Detailed Information
Include URL, timeout type, attempt number, and timing information in your logs.
Testing Your Timeout Handling
Create tests to verify your timeout handling works correctly:
require 'webmock'
RSpec.describe "Timeout Handling" do
before do
WebMock.enable!
end
after do
WebMock.disable!
end
it "handles connection timeouts with retries" do
stub_request(:get, "http://slow-site.com")
.to_timeout
.then
.to_return(status: 200, body: "Success")
result = scrape_with_retry("http://slow-site.com")
expect(result).not_to be_nil
end
it "gives up after max retries" do
stub_request(:get, "http://failing-site.com")
.to_timeout.times(4) # Fail all attempts
expect {
scrape_with_retry("http://failing-site.com")
}.to raise_error(Net::OpenTimeout)
end
end
JavaScript Implementation for Comparison
While Mechanize is Ruby-specific, here's how similar timeout handling looks in JavaScript with axios:
const axios = require('axios');
class TimeoutHandler {
constructor(maxRetries = 3, baseDelay = 1000) {
this.maxRetries = maxRetries;
this.baseDelay = baseDelay;
}
async executeWithRetry(operation, description = "Operation") {
let attempt = 0;
while (attempt < this.maxRetries) {
try {
attempt++;
console.log(`${description} - Attempt ${attempt}/${this.maxRetries}`);
return await operation();
} catch (error) {
if (this.isTimeoutError(error) && attempt < this.maxRetries) {
const delay = this.baseDelay * Math.pow(2, attempt - 1);
console.log(`Timeout on attempt ${attempt}. Retrying in ${delay}ms...`);
await this.sleep(delay);
} else {
throw error;
}
}
}
}
isTimeoutError(error) {
return error.code === 'ECONNABORTED' ||
error.code === 'ETIMEDOUT' ||
error.message.includes('timeout');
}
sleep(ms) {
return new Promise(resolve => setTimeout(resolve, ms));
}
}
// Usage
const timeoutHandler = new TimeoutHandler(3, 1000);
async function scrapeWithTimeout(url) {
return timeoutHandler.executeWithRetry(async () => {
return axios.get(url, {
timeout: 30000, // 30 seconds
connectTimeout: 10000 // 10 seconds
});
}, `Scraping ${url}`);
}
Conclusion
Implementing robust timeout handling in Mechanize requires a multi-layered approach combining proper timeout configuration, retry logic with exponential backoff, monitoring, and graceful error recovery. The examples provided here give you a solid foundation for building reliable web scrapers that can handle network instability and server slowness effectively.
Remember to always be respectful of target servers by implementing appropriate delays between requests and not overwhelming slow servers with too aggressive retry attempts. When handling errors in browser automation tools, similar principles apply but with additional considerations for browser-specific timeouts.