How do You Implement Custom Error Handling for Network Timeouts in Mechanize?

Network timeouts are one of the most common challenges in web scraping, especially when dealing with slow or unreliable websites. Mechanize provides several mechanisms for handling timeouts, but implementing custom error handling ensures your scrapers are robust and can gracefully recover from network issues.

Understanding Timeout Types in Mechanize

Mechanize handles several types of timeouts that you need to consider when implementing custom error handling:

Connection Timeout

The time limit for establishing a connection to the server:

require 'mechanize'

agent = Mechanize.new
agent.open_timeout = 10  # 10 seconds to establish connection

Read Timeout

The time limit for reading data from an established connection:

agent.read_timeout = 30  # 30 seconds to read response

Idle Timeout

How long to keep connections alive for reuse:

agent.idle_timeout = 5  # Close idle connections after 5 seconds

Basic Timeout Error Handling

The most straightforward approach is to wrap your Mechanize requests in rescue blocks:

require 'mechanize'
require 'timeout'

def scrape_with_basic_timeout_handling(url)
  agent = Mechanize.new
  agent.open_timeout = 10
  agent.read_timeout = 30

  begin
    page = agent.get(url)
    return page
  rescue Net::OpenTimeout => e
    puts "Connection timeout: #{e.message}"
    return nil
  rescue Net::ReadTimeout => e
    puts "Read timeout: #{e.message}"
    return nil
  rescue Timeout::Error => e
    puts "General timeout: #{e.message}"
    return nil
  end
end

Implementing Retry Logic with Exponential Backoff

For production scraping, you'll want sophisticated retry mechanisms:

class TimeoutHandler
  MAX_RETRIES = 3
  BASE_DELAY = 1

  def self.with_retries(max_retries = MAX_RETRIES)
    attempt = 0
    begin
      attempt += 1
      yield
    rescue Net::OpenTimeout, Net::ReadTimeout, Timeout::Error => e
      if attempt <= max_retries
        delay = BASE_DELAY * (2 ** (attempt - 1))  # Exponential backoff
        puts "Timeout on attempt #{attempt}/#{max_retries}. Retrying in #{delay}s..."
        sleep(delay)
        retry
      else
        puts "Failed after #{max_retries} attempts: #{e.message}"
        raise e
      end
    end
  end
end

# Usage example
def scrape_with_retry(url)
  agent = Mechanize.new
  agent.open_timeout = 10
  agent.read_timeout = 30

  TimeoutHandler.with_retries do
    agent.get(url)
  end
end

Advanced Custom Error Handler Class

For complex scraping operations, create a dedicated error handler:

class MechanizeTimeoutHandler
  attr_accessor :max_retries, :base_delay, :max_delay, :backoff_multiplier

  def initialize(options = {})
    @max_retries = options[:max_retries] || 3
    @base_delay = options[:base_delay] || 1
    @max_delay = options[:max_delay] || 60
    @backoff_multiplier = options[:backoff_multiplier] || 2
    @logger = options[:logger] || Logger.new(STDOUT)
  end

  def execute_with_timeout_handling(description = "Operation")
    attempt = 0
    start_time = Time.now

    begin
      attempt += 1
      @logger.info("#{description} - Attempt #{attempt}/#{@max_retries + 1}")

      result = yield

      duration = Time.now - start_time
      @logger.info("#{description} completed successfully in #{duration.round(2)}s")
      return result

    rescue Net::OpenTimeout => e
      handle_timeout_error(e, "Connection timeout", attempt, description)
    rescue Net::ReadTimeout => e
      handle_timeout_error(e, "Read timeout", attempt, description)
    rescue Timeout::Error => e
      handle_timeout_error(e, "General timeout", attempt, description)
    rescue => e
      @logger.error("#{description} failed with unexpected error: #{e.message}")
      raise e
    end
  end

  private

  def handle_timeout_error(error, error_type, attempt, description)
    if attempt <= @max_retries
      delay = calculate_delay(attempt)
      @logger.warn("#{description} - #{error_type} on attempt #{attempt}. Retrying in #{delay}s...")
      sleep(delay)
      retry
    else
      total_time = Time.now - @start_time rescue 0
      @logger.error("#{description} failed after #{@max_retries + 1} attempts (#{total_time.round(2)}s): #{error.message}")
      raise error
    end
  end

  def calculate_delay(attempt)
    delay = @base_delay * (@backoff_multiplier ** (attempt - 1))
    [delay, @max_delay].min
  end
end

Using the Advanced Handler

# Initialize the handler with custom settings
timeout_handler = MechanizeTimeoutHandler.new(
  max_retries: 5,
  base_delay: 2,
  max_delay: 30,
  backoff_multiplier: 1.5
)

# Use it for scraping operations
def scrape_multiple_pages(urls)
  agent = Mechanize.new
  agent.open_timeout = 15
  agent.read_timeout = 45

  results = []

  urls.each_with_index do |url, index|
    begin
      page = timeout_handler.execute_with_timeout_handling("Scraping page #{index + 1}") do
        agent.get(url)
      end

      results << extract_data(page)

    rescue => e
      puts "Skipping #{url} due to persistent errors: #{e.message}"
      results << nil
    end

    # Add delay between requests to be respectful
    sleep(1)
  end

  results
end

Handling Specific Timeout Scenarios

Slow Loading Pages

For pages that consistently load slowly, adjust timeouts dynamically:

def scrape_slow_site(url)
  agent = Mechanize.new

  # Start with generous timeouts for known slow sites
  agent.open_timeout = 30
  agent.read_timeout = 120

  begin
    page = agent.get(url)
    return page
  rescue Net::ReadTimeout => e
    # If read timeout occurs, the connection was established
    # but the server is very slow - try once more with even longer timeout
    puts "Server is very slow, extending timeout..."
    agent.read_timeout = 300  # 5 minutes

    begin
      return agent.get(url)
    rescue Net::ReadTimeout
      puts "Server too slow even with extended timeout"
      raise e
    end
  end
end

JavaScript-Heavy Sites

When dealing with sites that require JavaScript execution, you might want to integrate with tools that handle timeouts in browser automation:

def scrape_with_fallback_to_browser(url)
  # First try with Mechanize (faster)
  begin
    agent = Mechanize.new
    agent.open_timeout = 10
    agent.read_timeout = 30

    return agent.get(url)

  rescue Net::OpenTimeout, Net::ReadTimeout => e
    puts "Mechanize failed, falling back to browser automation..."

    # Fallback to browser automation for JavaScript-heavy sites
    # This would integrate with Puppeteer or Selenium
    return scrape_with_browser(url)
  end
end

Monitoring and Alerting

Implement monitoring to track timeout patterns:

class TimeoutMonitor
  def initialize
    @timeout_stats = Hash.new(0)
    @total_requests = 0
  end

  def record_timeout(url, error_type)
    @timeout_stats["#{url}_#{error_type}"] += 1
    @total_requests += 1

    # Alert if timeout rate is too high
    if timeout_rate > 0.1  # 10% threshold
      alert_high_timeout_rate
    end
  end

  def timeout_rate
    total_timeouts = @timeout_stats.values.sum
    return 0 if @total_requests == 0
    total_timeouts.to_f / @total_requests
  end

  def alert_high_timeout_rate
    puts "WARNING: High timeout rate detected (#{(timeout_rate * 100).round(1)}%)"
    # Implement your alerting logic here
  end

  def report
    puts "Timeout Statistics:"
    puts "Total requests: #{@total_requests}"
    puts "Timeout rate: #{(timeout_rate * 100).round(2)}%"
    puts "Breakdown:"
    @timeout_stats.each do |key, count|
      puts "  #{key}: #{count}"
    end
  end
end

Best Practices for Timeout Handling

1. Set Appropriate Timeouts

Connection timeout: 10-15 seconds for most sites
Read timeout: 30-60 seconds depending on expected response size
Longer timeouts: For APIs or sites known to be slow

2. Implement Circuit Breaker Pattern

class CircuitBreaker
  def initialize(failure_threshold = 5, timeout_period = 60)
    @failure_threshold = failure_threshold
    @timeout_period = timeout_period
    @failure_count = 0
    @last_failure_time = nil
    @state = :closed  # :closed, :open, :half_open
  end

  def call
    case @state
    when :open
      if Time.now - @last_failure_time > @timeout_period
        @state = :half_open
      else
        raise "Circuit breaker is OPEN"
      end
    end

    begin
      result = yield
      @failure_count = 0
      @state = :closed
      result
    rescue Net::OpenTimeout, Net::ReadTimeout => e
      @failure_count += 1
      @last_failure_time = Time.now

      if @failure_count >= @failure_threshold
        @state = :open
      end

      raise e
    end
  end
end

3. Graceful Degradation

Always have fallback strategies when timeouts occur persistently.

4. Log Detailed Information

Include URL, timeout type, attempt number, and timing information in your logs.

Testing Your Timeout Handling

Create tests to verify your timeout handling works correctly:

require 'webmock'

RSpec.describe "Timeout Handling" do
  before do
    WebMock.enable!
  end

  after do
    WebMock.disable!
  end

  it "handles connection timeouts with retries" do
    stub_request(:get, "http://slow-site.com")
      .to_timeout
      .then
      .to_return(status: 200, body: "Success")

    result = scrape_with_retry("http://slow-site.com")
    expect(result).not_to be_nil
  end

  it "gives up after max retries" do
    stub_request(:get, "http://failing-site.com")
      .to_timeout.times(4)  # Fail all attempts

    expect {
      scrape_with_retry("http://failing-site.com")
    }.to raise_error(Net::OpenTimeout)
  end
end

JavaScript Implementation for Comparison

While Mechanize is Ruby-specific, here's how similar timeout handling looks in JavaScript with axios:

const axios = require('axios');

class TimeoutHandler {
  constructor(maxRetries = 3, baseDelay = 1000) {
    this.maxRetries = maxRetries;
    this.baseDelay = baseDelay;
  }

  async executeWithRetry(operation, description = "Operation") {
    let attempt = 0;

    while (attempt < this.maxRetries) {
      try {
        attempt++;
        console.log(`${description} - Attempt ${attempt}/${this.maxRetries}`);

        return await operation();

      } catch (error) {
        if (this.isTimeoutError(error) && attempt < this.maxRetries) {
          const delay = this.baseDelay * Math.pow(2, attempt - 1);
          console.log(`Timeout on attempt ${attempt}. Retrying in ${delay}ms...`);
          await this.sleep(delay);
        } else {
          throw error;
        }
      }
    }
  }

  isTimeoutError(error) {
    return error.code === 'ECONNABORTED' || 
           error.code === 'ETIMEDOUT' ||
           error.message.includes('timeout');
  }

  sleep(ms) {
    return new Promise(resolve => setTimeout(resolve, ms));
  }
}

// Usage
const timeoutHandler = new TimeoutHandler(3, 1000);

async function scrapeWithTimeout(url) {
  return timeoutHandler.executeWithRetry(async () => {
    return axios.get(url, {
      timeout: 30000,  // 30 seconds
      connectTimeout: 10000  // 10 seconds
    });
  }, `Scraping ${url}`);
}

Conclusion

Implementing robust timeout handling in Mechanize requires a multi-layered approach combining proper timeout configuration, retry logic with exponential backoff, monitoring, and graceful error recovery. The examples provided here give you a solid foundation for building reliable web scrapers that can handle network instability and server slowness effectively.

Remember to always be respectful of target servers by implementing appropriate delays between requests and not overwhelming slow servers with too aggressive retry attempts. When handling errors in browser automation tools, similar principles apply but with additional considerations for browser-specific timeouts.

Table of contents