How do you implement retry logic for failed requests in Mechanize?
When web scraping with Mechanize, network failures, server errors, and temporary unavailability are common challenges that can disrupt your scraping operations. Implementing robust retry logic is essential for building reliable scrapers that can handle these transient issues gracefully. This guide covers various approaches to implement retry mechanisms in Mechanize, from simple retry loops to sophisticated exponential backoff strategies.
Understanding Common Failure Scenarios
Before implementing retry logic, it's important to understand the types of failures you might encounter:
- Network timeouts: Connection or read timeouts due to slow networks
- HTTP errors: 5xx server errors, 429 rate limiting, temporary 503 unavailability
- Connection errors: DNS resolution failures, connection refused
- SSL/TLS errors: Certificate issues or handshake failures
Basic Retry Implementation
The simplest approach to retry logic involves wrapping your Mechanize requests in a retry loop:
require 'mechanize'
def fetch_with_retry(url, max_retries = 3)
agent = Mechanize.new
retries = 0
begin
page = agent.get(url)
return page
rescue Mechanize::ResponseCodeError, Net::TimeoutError, SocketError => e
retries += 1
if retries <= max_retries
puts "Request failed (#{e.class}), retrying #{retries}/#{max_retries}..."
sleep(1)
retry
else
puts "Max retries exceeded, giving up"
raise e
end
end
end
# Usage
begin
page = fetch_with_retry('https://example.com', 3)
puts "Successfully fetched: #{page.title}"
rescue => e
puts "Failed to fetch page: #{e.message}"
end
Advanced Retry with Exponential Backoff
For production environments, exponential backoff helps reduce server load and improves success rates:
require 'mechanize'
class MechanizeRetryHandler
def initialize(agent = nil)
@agent = agent || Mechanize.new
setup_agent
end
def fetch_with_backoff(url, max_retries: 5, base_delay: 1, max_delay: 60)
retries = 0
begin
@agent.get(url)
rescue => e
retries += 1
if retries <= max_retries && retryable_error?(e)
delay = calculate_delay(retries, base_delay, max_delay)
puts "Attempt #{retries} failed: #{e.message}"
puts "Retrying in #{delay} seconds..."
sleep(delay)
retry
else
raise e
end
end
end
private
def setup_agent
@agent.user_agent_alias = 'Windows Chrome'
@agent.open_timeout = 10
@agent.read_timeout = 30
@agent.follow_meta_refresh = true
end
def retryable_error?(error)
case error
when Mechanize::ResponseCodeError
# Retry on server errors and rate limiting
[429, 500, 502, 503, 504].include?(error.response_code.to_i)
when Net::TimeoutError, SocketError, Errno::ECONNRESET, Errno::ECONNREFUSED
true
when OpenSSL::SSL::SSLError
# Retry SSL errors that might be temporary
true
else
false
end
end
def calculate_delay(attempt, base_delay, max_delay)
# Exponential backoff with jitter
delay = base_delay * (2 ** (attempt - 1))
jitter = rand(0.1..0.5) * delay
[delay + jitter, max_delay].min
end
end
# Usage
handler = MechanizeRetryHandler.new
begin
page = handler.fetch_with_backoff('https://api.example.com/data')
puts "Success: #{page.body.length} bytes received"
rescue => e
puts "Failed after all retries: #{e.message}"
end
Conditional Retry Logic
Sometimes you need different retry strategies based on the specific error or response:
class ConditionalRetryHandler
def initialize
@agent = Mechanize.new
setup_agent_settings
end
def smart_fetch(url, options = {})
max_retries = options[:max_retries] || 3
rate_limit_retries = options[:rate_limit_retries] || 10
retries = 0
rate_limit_retries_count = 0
begin
response = @agent.get(url)
return response
rescue Mechanize::ResponseCodeError => e
case e.response_code.to_i
when 429 # Rate limited
rate_limit_retries_count += 1
if rate_limit_retries_count <= rate_limit_retries
# Extract retry-after header if available
retry_after = e.page.response['retry-after']&.to_i || 60
puts "Rate limited, waiting #{retry_after} seconds..."
sleep(retry_after)
retry
else
raise "Rate limit exceeded maximum retries"
end
when 503, 502, 500 # Server errors
retries += 1
if retries <= max_retries
delay = 2 ** retries + rand(1..5)
puts "Server error #{e.response_code}, retrying in #{delay}s..."
sleep(delay)
retry
else
raise e
end
when 404, 403, 401 # Client errors - don't retry
raise e
else
retries += 1
if retries <= max_retries
sleep(retries * 2)
retry
else
raise e
end
end
rescue Net::TimeoutError => e
retries += 1
if retries <= max_retries
puts "Timeout error, increasing timeouts and retrying..."
# Increase timeouts progressively
@agent.open_timeout = 10 + (retries * 5)
@agent.read_timeout = 30 + (retries * 10)
sleep(retries)
retry
else
raise e
end
rescue SocketError, Errno::ECONNRESET => e
retries += 1
if retries <= max_retries
puts "Connection error, retrying with fresh agent..."
# Create new agent for connection issues
@agent = Mechanize.new
setup_agent_settings
sleep(retries * 2)
retry
else
raise e
end
end
end
private
def setup_agent_settings
@agent.user_agent_alias = 'Mac Safari'
@agent.open_timeout = 10
@agent.read_timeout = 30
@agent.gzip_enabled = true
@agent.follow_meta_refresh = true
end
end
Circuit Breaker Pattern
For high-volume scraping, implement a circuit breaker to prevent cascading failures:
class CircuitBreakerMechanize
def initialize
@agent = Mechanize.new
@failure_count = 0
@last_failure_time = nil
@circuit_open = false
@failure_threshold = 5
@timeout_duration = 300 # 5 minutes
end
def fetch_with_circuit_breaker(url)
if circuit_open?
raise "Circuit breaker is open, service unavailable"
end
begin
response = @agent.get(url)
on_success
return response
rescue => e
on_failure(e)
raise e
end
end
private
def circuit_open?
@circuit_open &&
@last_failure_time &&
(Time.now - @last_failure_time) < @timeout_duration
end
def on_success
@failure_count = 0
@circuit_open = false
end
def on_failure(error)
@failure_count += 1
@last_failure_time = Time.now
if @failure_count >= @failure_threshold
@circuit_open = true
puts "Circuit breaker opened due to repeated failures"
end
end
end
Implementing Retry with Error Classification
A more sophisticated approach involves classifying errors and applying different strategies:
require 'mechanize'
module RetryStrategies
TRANSIENT_ERRORS = [
Net::TimeoutError,
SocketError,
Errno::ECONNRESET,
Errno::ECONNREFUSED,
OpenSSL::SSL::SSLError
].freeze
SERVER_ERROR_CODES = [500, 502, 503, 504].freeze
RATE_LIMIT_CODES = [429].freeze
CLIENT_ERROR_CODES = [400, 401, 403, 404].freeze
end
class IntelligentRetryHandler
include RetryStrategies
def initialize(config = {})
@agent = Mechanize.new
@config = default_config.merge(config)
setup_agent
end
def fetch_with_intelligent_retry(url)
attempts = 0
last_error = nil
loop do
attempts += 1
begin
return @agent.get(url)
rescue => error
last_error = error
break unless should_retry?(error, attempts)
delay = calculate_delay(error, attempts)
log_retry_attempt(error, attempts, delay)
handle_special_errors(error)
sleep(delay)
end
end
raise last_error
end
private
def default_config
{
max_retries: 5,
base_delay: 1,
max_delay: 300,
backoff_multiplier: 2,
jitter: true,
rate_limit_patience: 10
}
end
def setup_agent
@agent.user_agent_alias = 'Mac Safari'
@agent.open_timeout = 15
@agent.read_timeout = 60
@agent.gzip_enabled = true
@agent.follow_meta_refresh = true
end
def should_retry?(error, attempts)
return false if attempts > @config[:max_retries]
case error
when Mechanize::ResponseCodeError
code = error.response_code.to_i
SERVER_ERROR_CODES.include?(code) || RATE_LIMIT_CODES.include?(code)
when *TRANSIENT_ERRORS
true
else
false
end
end
def calculate_delay(error, attempts)
case error
when Mechanize::ResponseCodeError
if error.response_code.to_i == 429
# Respect Retry-After header for rate limiting
retry_after = error.page&.response&.[]('retry-after')&.to_i
return retry_after if retry_after && retry_after > 0
end
end
# Standard exponential backoff
delay = @config[:base_delay] * (@config[:backoff_multiplier] ** (attempts - 1))
# Add jitter if enabled
if @config[:jitter]
jitter = delay * (0.1 + rand * 0.1) # 10-20% jitter
delay += jitter
end
[delay, @config[:max_delay]].min
end
def handle_special_errors(error)
case error
when Net::TimeoutError
# Increase timeouts for subsequent requests
@agent.open_timeout = [@agent.open_timeout * 1.5, 60].min
@agent.read_timeout = [@agent.read_timeout * 1.5, 120].min
when SocketError, Errno::ECONNRESET
# Recreate agent for connection issues
old_config = {
user_agent: @agent.user_agent,
open_timeout: @agent.open_timeout,
read_timeout: @agent.read_timeout
}
@agent = Mechanize.new
@agent.user_agent = old_config[:user_agent]
@agent.open_timeout = old_config[:open_timeout]
@agent.read_timeout = old_config[:read_timeout]
end
end
def log_retry_attempt(error, attempts, delay)
puts "[Retry #{attempts}/#{@config[:max_retries]}] #{error.class}: #{error.message}"
puts "Waiting #{delay.round(2)} seconds before retry..."
end
end
# Usage example
handler = IntelligentRetryHandler.new(
max_retries: 7,
base_delay: 2,
max_delay: 120
)
begin
page = handler.fetch_with_intelligent_retry('https://api.example.com/data')
puts "Success: Retrieved #{page.body.length} bytes"
rescue => e
puts "Failed after all retries: #{e.message}"
end
Combining Retry Logic with Async Processing
For large-scale scraping operations, combine retry logic with concurrent processing:
require 'mechanize'
require 'concurrent-ruby'
class ConcurrentRetryHandler
def initialize(pool_size: 10)
@pool = Concurrent::ThreadPoolExecutor.new(
min_threads: 2,
max_threads: pool_size,
max_queue: pool_size * 2
)
@agents = Concurrent::Array.new
pool_size.times { @agents << create_agent }
end
def fetch_multiple_urls(urls)
futures = urls.map do |url|
Concurrent::Future.execute(executor: @pool) do
fetch_with_retry(url)
end
end
# Wait for all requests to complete
results = futures.map(&:value)
errors = futures.select(&:rejected?).map(&:reason)
{ successes: results.compact, errors: errors }
end
private
def create_agent
agent = Mechanize.new
agent.user_agent_alias = 'Mac Safari'
agent.open_timeout = 15
agent.read_timeout = 45
agent.gzip_enabled = true
agent
end
def get_agent
@agents.sample || create_agent
end
def fetch_with_retry(url, max_retries: 3)
retries = 0
begin
agent = get_agent
agent.get(url)
rescue => e
retries += 1
if retries <= max_retries && retryable_error?(e)
delay = 2 ** retries + rand(1..3)
sleep(delay)
retry
else
raise e
end
end
end
def retryable_error?(error)
case error
when Mechanize::ResponseCodeError
[429, 500, 502, 503, 504].include?(error.response_code.to_i)
when Net::TimeoutError, SocketError, Errno::ECONNRESET
true
else
false
end
end
end
Best Practices for Mechanize Retry Logic
When implementing retry logic for Mechanize, consider these best practices:
1. Error Classification
Always differentiate between retryable and non-retryable errors:
def retryable_error?(error)
case error
when Mechanize::ResponseCodeError
# Server errors and rate limiting are retryable
code = error.response_code.to_i
[429, 500, 502, 503, 504].include?(code)
when Net::TimeoutError, SocketError, Errno::ECONNRESET, Errno::ECONNREFUSED
true # Network-level errors are typically retryable
when OpenSSL::SSL::SSLError
# Some SSL errors might be temporary
error.message.include?('timeout') || error.message.include?('reset')
else
false # Unknown errors shouldn't be retried
end
end
2. Respect Server Signals
Always check for and respect Retry-After
headers:
def extract_retry_after(response)
retry_after = response.response['retry-after']
return nil unless retry_after
# Can be either seconds or HTTP date
if retry_after.match?(/^\d+$/)
retry_after.to_i
else
Time.parse(retry_after) - Time.now
end
rescue
nil
end
3. Implement Circuit Breakers
For high-volume scraping, use circuit breakers to prevent system overload:
class SimpleCircuitBreaker
def initialize(failure_threshold: 5, timeout: 60)
@failure_threshold = failure_threshold
@timeout = timeout
@failure_count = 0
@last_failure_time = nil
@state = :closed # :closed, :open, :half_open
end
def call
case @state
when :closed
execute_with_failure_tracking { yield }
when :open
if Time.now - @last_failure_time > @timeout
@state = :half_open
execute_with_failure_tracking { yield }
else
raise "Circuit breaker is open"
end
when :half_open
execute_with_failure_tracking { yield }
end
end
private
def execute_with_failure_tracking
result = yield
@failure_count = 0
@state = :closed
result
rescue => e
@failure_count += 1
@last_failure_time = Time.now
if @failure_count >= @failure_threshold
@state = :open
end
raise e
end
end
Similar to how you handle timeouts in Puppeteer, implementing proper retry mechanisms in Mechanize ensures your web scraping operations remain robust and reliable even when facing network instability or server issues.
Monitoring and Observability
Track retry patterns to optimize your scraping strategy:
class RetryMetrics
def initialize
@metrics = {
total_requests: 0,
successful_requests: 0,
failed_requests: 0,
retry_counts: Hash.new(0),
error_types: Hash.new(0),
response_times: []
}
end
def track_request(url)
start_time = Time.now
retries = 0
begin
yield
@metrics[:successful_requests] += 1
@metrics[:retry_counts][retries] += 1
rescue => e
retries += 1
@metrics[:error_types][e.class.name] += 1
if retries <= 3 # Assuming max 3 retries
@metrics[:retry_counts][retries] += 1
retry
else
@metrics[:failed_requests] += 1
raise e
end
ensure
@metrics[:total_requests] += 1
@metrics[:response_times] << (Time.now - start_time)
end
end
def summary
success_rate = (@metrics[:successful_requests].to_f / @metrics[:total_requests] * 100).round(2)
avg_response_time = (@metrics[:response_times].sum / @metrics[:response_times].length).round(3)
puts "=== Scraping Metrics ==="
puts "Total requests: #{@metrics[:total_requests]}"
puts "Success rate: #{success_rate}%"
puts "Average response time: #{avg_response_time}s"
puts "Retry distribution: #{@metrics[:retry_counts]}"
puts "Error types: #{@metrics[:error_types]}"
end
end
When dealing with complex web applications that require robust error handling, these retry patterns become even more critical. Just as handling errors in Puppeteer requires careful consideration of different failure modes, Mechanize retry logic should be tailored to your specific scraping requirements and the characteristics of the target websites.
Testing Retry Logic
Always test your retry mechanisms to ensure they work as expected:
require 'rspec'
describe 'Mechanize Retry Logic' do
let(:handler) { MechanizeRetryHandler.new }
it 'retries on server errors' do
# Mock server error responses
allow_any_instance_of(Mechanize).to receive(:get)
.and_raise(Mechanize::ResponseCodeError.new(double(code: '500')))
.exactly(3).times
.then.return(double(body: 'success'))
expect { handler.fetch_with_backoff('http://example.com') }.not_to raise_error
end
it 'gives up after max retries' do
allow_any_instance_of(Mechanize).to receive(:get)
.and_raise(Net::TimeoutError)
expect { handler.fetch_with_backoff('http://example.com', max_retries: 2) }
.to raise_error(Net::TimeoutError)
end
it 'does not retry client errors' do
allow_any_instance_of(Mechanize).to receive(:get)
.and_raise(Mechanize::ResponseCodeError.new(double(code: '404')))
expect { handler.fetch_with_backoff('http://example.com') }
.to raise_error(Mechanize::ResponseCodeError)
end
end
Conclusion
Implementing effective retry logic in Mechanize is crucial for building reliable web scrapers. By combining basic retry mechanisms with exponential backoff, conditional logic, and proper error handling, you can create robust scraping solutions that gracefully handle network issues and server errors.
Key takeaways for implementing Mechanize retry logic:
- Classify errors appropriately - not all errors should trigger retries
- Use exponential backoff with jitter to avoid overwhelming servers
- Respect server signals like
Retry-After
headers - Implement circuit breakers for high-volume operations
- Monitor and track metrics to optimize your strategies
- Test your retry logic thoroughly
Remember to balance persistence with respect for server resources, ensuring your scrapers can recover from temporary failures while avoiding aggressive behavior that might lead to IP blocking or other defensive measures from target websites. The goal is to create resilient scraping operations that can handle the unpredictable nature of web environments while maintaining good citizenship on the internet.