How do you implement retry logic for failed requests in Mechanize?

When web scraping with Mechanize, network failures, server errors, and temporary unavailability are common challenges that can disrupt your scraping operations. Implementing robust retry logic is essential for building reliable scrapers that can handle these transient issues gracefully. This guide covers various approaches to implement retry mechanisms in Mechanize, from simple retry loops to sophisticated exponential backoff strategies.

Understanding Common Failure Scenarios

Before implementing retry logic, it's important to understand the types of failures you might encounter:

Network timeouts: Connection or read timeouts due to slow networks
HTTP errors: 5xx server errors, 429 rate limiting, temporary 503 unavailability
Connection errors: DNS resolution failures, connection refused
SSL/TLS errors: Certificate issues or handshake failures

Basic Retry Implementation

The simplest approach to retry logic involves wrapping your Mechanize requests in a retry loop:

require 'mechanize'

def fetch_with_retry(url, max_retries = 3)
  agent = Mechanize.new
  retries = 0

  begin
    page = agent.get(url)
    return page
  rescue Mechanize::ResponseCodeError, Net::TimeoutError, SocketError => e
    retries += 1
    if retries <= max_retries
      puts "Request failed (#{e.class}), retrying #{retries}/#{max_retries}..."
      sleep(1)
      retry
    else
      puts "Max retries exceeded, giving up"
      raise e
    end
  end
end

# Usage
begin
  page = fetch_with_retry('https://example.com', 3)
  puts "Successfully fetched: #{page.title}"
rescue => e
  puts "Failed to fetch page: #{e.message}"
end

Advanced Retry with Exponential Backoff

For production environments, exponential backoff helps reduce server load and improves success rates:

require 'mechanize'

class MechanizeRetryHandler
  def initialize(agent = nil)
    @agent = agent || Mechanize.new
    setup_agent
  end

  def fetch_with_backoff(url, max_retries: 5, base_delay: 1, max_delay: 60)
    retries = 0

    begin
      @agent.get(url)
    rescue => e
      retries += 1

      if retries <= max_retries && retryable_error?(e)
        delay = calculate_delay(retries, base_delay, max_delay)
        puts "Attempt #{retries} failed: #{e.message}"
        puts "Retrying in #{delay} seconds..."
        sleep(delay)
        retry
      else
        raise e
      end
    end
  end

  private

  def setup_agent
    @agent.user_agent_alias = 'Windows Chrome'
    @agent.open_timeout = 10
    @agent.read_timeout = 30
    @agent.follow_meta_refresh = true
  end

  def retryable_error?(error)
    case error
    when Mechanize::ResponseCodeError
      # Retry on server errors and rate limiting
      [429, 500, 502, 503, 504].include?(error.response_code.to_i)
    when Net::TimeoutError, SocketError, Errno::ECONNRESET, Errno::ECONNREFUSED
      true
    when OpenSSL::SSL::SSLError
      # Retry SSL errors that might be temporary
      true
    else
      false
    end
  end

  def calculate_delay(attempt, base_delay, max_delay)
    # Exponential backoff with jitter
    delay = base_delay * (2 ** (attempt - 1))
    jitter = rand(0.1..0.5) * delay
    [delay + jitter, max_delay].min
  end
end

# Usage
handler = MechanizeRetryHandler.new
begin
  page = handler.fetch_with_backoff('https://api.example.com/data')
  puts "Success: #{page.body.length} bytes received"
rescue => e
  puts "Failed after all retries: #{e.message}"
end

Conditional Retry Logic

Sometimes you need different retry strategies based on the specific error or response:

class ConditionalRetryHandler
  def initialize
    @agent = Mechanize.new
    setup_agent_settings
  end

  def smart_fetch(url, options = {})
    max_retries = options[:max_retries] || 3
    rate_limit_retries = options[:rate_limit_retries] || 10
    retries = 0
    rate_limit_retries_count = 0

    begin
      response = @agent.get(url)
      return response

    rescue Mechanize::ResponseCodeError => e
      case e.response_code.to_i
      when 429 # Rate limited
        rate_limit_retries_count += 1
        if rate_limit_retries_count <= rate_limit_retries
          # Extract retry-after header if available
          retry_after = e.page.response['retry-after']&.to_i || 60
          puts "Rate limited, waiting #{retry_after} seconds..."
          sleep(retry_after)
          retry
        else
          raise "Rate limit exceeded maximum retries"
        end

      when 503, 502, 500 # Server errors
        retries += 1
        if retries <= max_retries
          delay = 2 ** retries + rand(1..5)
          puts "Server error #{e.response_code}, retrying in #{delay}s..."
          sleep(delay)
          retry
        else
          raise e
        end

      when 404, 403, 401 # Client errors - don't retry
        raise e

      else
        retries += 1
        if retries <= max_retries
          sleep(retries * 2)
          retry
        else
          raise e
        end
      end

    rescue Net::TimeoutError => e
      retries += 1
      if retries <= max_retries
        puts "Timeout error, increasing timeouts and retrying..."
        # Increase timeouts progressively
        @agent.open_timeout = 10 + (retries * 5)
        @agent.read_timeout = 30 + (retries * 10)
        sleep(retries)
        retry
      else
        raise e
      end

    rescue SocketError, Errno::ECONNRESET => e
      retries += 1
      if retries <= max_retries
        puts "Connection error, retrying with fresh agent..."
        # Create new agent for connection issues
        @agent = Mechanize.new
        setup_agent_settings
        sleep(retries * 2)
        retry
      else
        raise e
      end
    end
  end

  private

  def setup_agent_settings
    @agent.user_agent_alias = 'Mac Safari'
    @agent.open_timeout = 10
    @agent.read_timeout = 30
    @agent.gzip_enabled = true
    @agent.follow_meta_refresh = true
  end
end

Circuit Breaker Pattern

For high-volume scraping, implement a circuit breaker to prevent cascading failures:

class CircuitBreakerMechanize
  def initialize
    @agent = Mechanize.new
    @failure_count = 0
    @last_failure_time = nil
    @circuit_open = false
    @failure_threshold = 5
    @timeout_duration = 300 # 5 minutes
  end

  def fetch_with_circuit_breaker(url)
    if circuit_open?
      raise "Circuit breaker is open, service unavailable"
    end

    begin
      response = @agent.get(url)
      on_success
      return response

    rescue => e
      on_failure(e)
      raise e
    end
  end

  private

  def circuit_open?
    @circuit_open && 
    @last_failure_time && 
    (Time.now - @last_failure_time) < @timeout_duration
  end

  def on_success
    @failure_count = 0
    @circuit_open = false
  end

  def on_failure(error)
    @failure_count += 1
    @last_failure_time = Time.now

    if @failure_count >= @failure_threshold
      @circuit_open = true
      puts "Circuit breaker opened due to repeated failures"
    end
  end
end

Implementing Retry with Error Classification

A more sophisticated approach involves classifying errors and applying different strategies:

require 'mechanize'

module RetryStrategies
  TRANSIENT_ERRORS = [
    Net::TimeoutError,
    SocketError,
    Errno::ECONNRESET,
    Errno::ECONNREFUSED,
    OpenSSL::SSL::SSLError
  ].freeze

  SERVER_ERROR_CODES = [500, 502, 503, 504].freeze
  RATE_LIMIT_CODES = [429].freeze
  CLIENT_ERROR_CODES = [400, 401, 403, 404].freeze
end

class IntelligentRetryHandler
  include RetryStrategies

  def initialize(config = {})
    @agent = Mechanize.new
    @config = default_config.merge(config)
    setup_agent
  end

  def fetch_with_intelligent_retry(url)
    attempts = 0
    last_error = nil

    loop do
      attempts += 1

      begin
        return @agent.get(url)

      rescue => error
        last_error = error

        break unless should_retry?(error, attempts)

        delay = calculate_delay(error, attempts)
        log_retry_attempt(error, attempts, delay)

        handle_special_errors(error)
        sleep(delay)
      end
    end

    raise last_error
  end

  private

  def default_config
    {
      max_retries: 5,
      base_delay: 1,
      max_delay: 300,
      backoff_multiplier: 2,
      jitter: true,
      rate_limit_patience: 10
    }
  end

  def setup_agent
    @agent.user_agent_alias = 'Mac Safari'
    @agent.open_timeout = 15
    @agent.read_timeout = 60
    @agent.gzip_enabled = true
    @agent.follow_meta_refresh = true
  end

  def should_retry?(error, attempts)
    return false if attempts > @config[:max_retries]

    case error
    when Mechanize::ResponseCodeError
      code = error.response_code.to_i
      SERVER_ERROR_CODES.include?(code) || RATE_LIMIT_CODES.include?(code)
    when *TRANSIENT_ERRORS
      true
    else
      false
    end
  end

  def calculate_delay(error, attempts)
    case error
    when Mechanize::ResponseCodeError
      if error.response_code.to_i == 429
        # Respect Retry-After header for rate limiting
        retry_after = error.page&.response&.[]('retry-after')&.to_i
        return retry_after if retry_after && retry_after > 0
      end
    end

    # Standard exponential backoff
    delay = @config[:base_delay] * (@config[:backoff_multiplier] ** (attempts - 1))

    # Add jitter if enabled
    if @config[:jitter]
      jitter = delay * (0.1 + rand * 0.1) # 10-20% jitter
      delay += jitter
    end

    [delay, @config[:max_delay]].min
  end

  def handle_special_errors(error)
    case error
    when Net::TimeoutError
      # Increase timeouts for subsequent requests
      @agent.open_timeout = [@agent.open_timeout * 1.5, 60].min
      @agent.read_timeout = [@agent.read_timeout * 1.5, 120].min

    when SocketError, Errno::ECONNRESET
      # Recreate agent for connection issues
      old_config = {
        user_agent: @agent.user_agent,
        open_timeout: @agent.open_timeout,
        read_timeout: @agent.read_timeout
      }
      @agent = Mechanize.new
      @agent.user_agent = old_config[:user_agent]
      @agent.open_timeout = old_config[:open_timeout]
      @agent.read_timeout = old_config[:read_timeout]
    end
  end

  def log_retry_attempt(error, attempts, delay)
    puts "[Retry #{attempts}/#{@config[:max_retries]}] #{error.class}: #{error.message}"
    puts "Waiting #{delay.round(2)} seconds before retry..."
  end
end

# Usage example
handler = IntelligentRetryHandler.new(
  max_retries: 7,
  base_delay: 2,
  max_delay: 120
)

begin
  page = handler.fetch_with_intelligent_retry('https://api.example.com/data')
  puts "Success: Retrieved #{page.body.length} bytes"
rescue => e
  puts "Failed after all retries: #{e.message}"
end

Combining Retry Logic with Async Processing

For large-scale scraping operations, combine retry logic with concurrent processing:

require 'mechanize'
require 'concurrent-ruby'

class ConcurrentRetryHandler
  def initialize(pool_size: 10)
    @pool = Concurrent::ThreadPoolExecutor.new(
      min_threads: 2,
      max_threads: pool_size,
      max_queue: pool_size * 2
    )
    @agents = Concurrent::Array.new
    pool_size.times { @agents << create_agent }
  end

  def fetch_multiple_urls(urls)
    futures = urls.map do |url|
      Concurrent::Future.execute(executor: @pool) do
        fetch_with_retry(url)
      end
    end

    # Wait for all requests to complete
    results = futures.map(&:value)
    errors = futures.select(&:rejected?).map(&:reason)

    { successes: results.compact, errors: errors }
  end

  private

  def create_agent
    agent = Mechanize.new
    agent.user_agent_alias = 'Mac Safari'
    agent.open_timeout = 15
    agent.read_timeout = 45
    agent.gzip_enabled = true
    agent
  end

  def get_agent
    @agents.sample || create_agent
  end

  def fetch_with_retry(url, max_retries: 3)
    retries = 0

    begin
      agent = get_agent
      agent.get(url)
    rescue => e
      retries += 1
      if retries <= max_retries && retryable_error?(e)
        delay = 2 ** retries + rand(1..3)
        sleep(delay)
        retry
      else
        raise e
      end
    end
  end

  def retryable_error?(error)
    case error
    when Mechanize::ResponseCodeError
      [429, 500, 502, 503, 504].include?(error.response_code.to_i)
    when Net::TimeoutError, SocketError, Errno::ECONNRESET
      true
    else
      false
    end
  end
end

Best Practices for Mechanize Retry Logic

When implementing retry logic for Mechanize, consider these best practices:

1. Error Classification

Always differentiate between retryable and non-retryable errors:

def retryable_error?(error)
  case error
  when Mechanize::ResponseCodeError
    # Server errors and rate limiting are retryable
    code = error.response_code.to_i
    [429, 500, 502, 503, 504].include?(code)
  when Net::TimeoutError, SocketError, Errno::ECONNRESET, Errno::ECONNREFUSED
    true # Network-level errors are typically retryable
  when OpenSSL::SSL::SSLError
    # Some SSL errors might be temporary
    error.message.include?('timeout') || error.message.include?('reset')
  else
    false # Unknown errors shouldn't be retried
  end
end

2. Respect Server Signals

Always check for and respect Retry-After headers:

def extract_retry_after(response)
  retry_after = response.response['retry-after']
  return nil unless retry_after

  # Can be either seconds or HTTP date
  if retry_after.match?(/^\d+$/)
    retry_after.to_i
  else
    Time.parse(retry_after) - Time.now
  end
rescue
  nil
end

3. Implement Circuit Breakers

For high-volume scraping, use circuit breakers to prevent system overload:

class SimpleCircuitBreaker
  def initialize(failure_threshold: 5, timeout: 60)
    @failure_threshold = failure_threshold
    @timeout = timeout
    @failure_count = 0
    @last_failure_time = nil
    @state = :closed # :closed, :open, :half_open
  end

  def call
    case @state
    when :closed
      execute_with_failure_tracking { yield }
    when :open
      if Time.now - @last_failure_time > @timeout
        @state = :half_open
        execute_with_failure_tracking { yield }
      else
        raise "Circuit breaker is open"
      end
    when :half_open
      execute_with_failure_tracking { yield }
    end
  end

  private

  def execute_with_failure_tracking
    result = yield
    @failure_count = 0
    @state = :closed
    result
  rescue => e
    @failure_count += 1
    @last_failure_time = Time.now

    if @failure_count >= @failure_threshold
      @state = :open
    end

    raise e
  end
end

Similar to how you handle timeouts in Puppeteer, implementing proper retry mechanisms in Mechanize ensures your web scraping operations remain robust and reliable even when facing network instability or server issues.

Monitoring and Observability

Track retry patterns to optimize your scraping strategy:

class RetryMetrics
  def initialize
    @metrics = {
      total_requests: 0,
      successful_requests: 0,
      failed_requests: 0,
      retry_counts: Hash.new(0),
      error_types: Hash.new(0),
      response_times: []
    }
  end

  def track_request(url)
    start_time = Time.now
    retries = 0

    begin
      yield

      @metrics[:successful_requests] += 1
      @metrics[:retry_counts][retries] += 1

    rescue => e
      retries += 1
      @metrics[:error_types][e.class.name] += 1

      if retries <= 3 # Assuming max 3 retries
        @metrics[:retry_counts][retries] += 1
        retry
      else
        @metrics[:failed_requests] += 1
        raise e
      end

    ensure
      @metrics[:total_requests] += 1
      @metrics[:response_times] << (Time.now - start_time)
    end
  end

  def summary
    success_rate = (@metrics[:successful_requests].to_f / @metrics[:total_requests] * 100).round(2)
    avg_response_time = (@metrics[:response_times].sum / @metrics[:response_times].length).round(3)

    puts "=== Scraping Metrics ==="
    puts "Total requests: #{@metrics[:total_requests]}"
    puts "Success rate: #{success_rate}%"
    puts "Average response time: #{avg_response_time}s"
    puts "Retry distribution: #{@metrics[:retry_counts]}"
    puts "Error types: #{@metrics[:error_types]}"
  end
end

When dealing with complex web applications that require robust error handling, these retry patterns become even more critical. Just as handling errors in Puppeteer requires careful consideration of different failure modes, Mechanize retry logic should be tailored to your specific scraping requirements and the characteristics of the target websites.

Testing Retry Logic

Always test your retry mechanisms to ensure they work as expected:

require 'rspec'

describe 'Mechanize Retry Logic' do
  let(:handler) { MechanizeRetryHandler.new }

  it 'retries on server errors' do
    # Mock server error responses
    allow_any_instance_of(Mechanize).to receive(:get)
      .and_raise(Mechanize::ResponseCodeError.new(double(code: '500')))
      .exactly(3).times
      .then.return(double(body: 'success'))

    expect { handler.fetch_with_backoff('http://example.com') }.not_to raise_error
  end

  it 'gives up after max retries' do
    allow_any_instance_of(Mechanize).to receive(:get)
      .and_raise(Net::TimeoutError)

    expect { handler.fetch_with_backoff('http://example.com', max_retries: 2) }
      .to raise_error(Net::TimeoutError)
  end

  it 'does not retry client errors' do
    allow_any_instance_of(Mechanize).to receive(:get)
      .and_raise(Mechanize::ResponseCodeError.new(double(code: '404')))

    expect { handler.fetch_with_backoff('http://example.com') }
      .to raise_error(Mechanize::ResponseCodeError)
  end
end

Conclusion

Implementing effective retry logic in Mechanize is crucial for building reliable web scrapers. By combining basic retry mechanisms with exponential backoff, conditional logic, and proper error handling, you can create robust scraping solutions that gracefully handle network issues and server errors.

Key takeaways for implementing Mechanize retry logic:

Classify errors appropriately - not all errors should trigger retries
Use exponential backoff with jitter to avoid overwhelming servers
Respect server signals like Retry-After headers
Implement circuit breakers for high-volume operations
Monitor and track metrics to optimize your strategies
Test your retry logic thoroughly

Remember to balance persistence with respect for server resources, ensuring your scrapers can recover from temporary failures while avoiding aggressive behavior that might lead to IP blocking or other defensive measures from target websites. The goal is to create resilient scraping operations that can handle the unpredictable nature of web environments while maintaining good citizenship on the internet.

Table of contents

How do you implement retry logic for failed requests in Mechanize?

Understanding Common Failure Scenarios

Basic Retry Implementation

Advanced Retry with Exponential Backoff

Conditional Retry Logic

Circuit Breaker Pattern

Implementing Retry with Error Classification

Combining Retry Logic with Async Processing

Best Practices for Mechanize Retry Logic

1. Error Classification

2. Respect Server Signals

3. Implement Circuit Breakers

Monitoring and Observability

Testing Retry Logic

Conclusion

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Related Questions

What are the different types of forms that Mechanize can handle?

How do you extract data from HTML tables using Mechanize?

What methods are available for parsing JSON responses with Mechanize?

Get Started Now

Support