What are the best practices for handling errors and exceptions in Mechanize scripts?

Error handling is a crucial aspect of building reliable web scraping applications with Mechanize. Robust error handling ensures your scripts can gracefully handle network issues, server errors, parsing problems, and unexpected responses. This guide covers comprehensive strategies for implementing effective error handling in your Mechanize scripts.

Understanding Common Mechanize Exceptions

HTTP-Related Errors

Mechanize can encounter various HTTP-related errors during web scraping operations:

require 'mechanize'

agent = Mechanize.new

begin
  page = agent.get('https://example.com/page')
rescue Mechanize::ResponseCodeError => e
  puts "HTTP Error: #{e.response_code} - #{e.message}"
  case e.response_code
  when '404'
    puts "Page not found"
  when '403'
    puts "Access forbidden - check authentication"
  when '500'
    puts "Server error - try again later"
  end
rescue Net::HTTP::Persistent::Error => e
  puts "Network connection error: #{e.message}"
end

Timeout Errors

Network timeouts are common when scraping websites with slow response times:

agent = Mechanize.new
agent.open_timeout = 10  # Connection timeout
agent.read_timeout = 30  # Read timeout

begin
  page = agent.get('https://slow-website.com')
rescue Net::TimeoutError => e
  puts "Request timed out: #{e.message}"
  # Implement retry logic or fallback behavior
rescue Timeout::Error => e
  puts "Operation timed out: #{e.message}"
end

SSL Certificate Errors

SSL certificate issues can occur when scraping HTTPS websites:

agent = Mechanize.new
agent.verify_mode = OpenSSL::SSL::VERIFY_NONE  # Use with caution

begin
  page = agent.get('https://self-signed-cert.com')
rescue OpenSSL::SSL::SSLError => e
  puts "SSL Error: #{e.message}"
  # Handle certificate validation issues
end

Implementing Comprehensive Error Handling

Basic Error Handling Structure

Create a robust error handling framework for your Mechanize scripts:

class MechanizeErrorHandler
  def self.with_error_handling
    retries = 0
    max_retries = 3

    begin
      yield
    rescue Mechanize::ResponseCodeError => e
      handle_http_error(e, retries, max_retries)
    rescue Net::TimeoutError, Timeout::Error => e
      handle_timeout_error(e, retries, max_retries)
    rescue SocketError, Errno::ECONNREFUSED => e
      handle_network_error(e, retries, max_retries)
    rescue StandardError => e
      handle_generic_error(e)
    end
  end

  private

  def self.handle_http_error(error, retries, max_retries)
    case error.response_code
    when '429', '503', '502', '504'
      retry_request(retries, max_retries, "HTTP #{error.response_code}")
    when '404'
      puts "Resource not found: #{error.page.uri}"
      return nil
    else
      puts "HTTP Error #{error.response_code}: #{error.message}"
      return nil
    end
  end

  def self.handle_timeout_error(error, retries, max_retries)
    retry_request(retries, max_retries, "Timeout")
  end

  def self.handle_network_error(error, retries, max_retries)
    retry_request(retries, max_retries, "Network")
  end

  def self.retry_request(retries, max_retries, error_type)
    if retries < max_retries
      retries += 1
      wait_time = 2 ** retries
      puts "#{error_type} error. Retrying in #{wait_time} seconds... (#{retries}/#{max_retries})"
      sleep(wait_time)
      retry
    else
      puts "Max retries exceeded for #{error_type} error"
      return nil
    end
  end

  def self.handle_generic_error(error)
    puts "Unexpected error: #{error.class} - #{error.message}"
    puts error.backtrace.first(5).join("\n")
    return nil
  end
end

Usage Example

def scrape_product_data(url)
  agent = Mechanize.new
  agent.user_agent_alias = 'Windows Chrome'

  MechanizeErrorHandler.with_error_handling do
    page = agent.get(url)

    # Extract product information
    title = page.search('.product-title').text.strip
    price = page.search('.price').text.strip

    {
      title: title,
      price: price,
      url: url,
      scraped_at: Time.now
    }
  end
end

Advanced Error Handling Strategies

Exponential Backoff for Rate Limiting

Implement exponential backoff when encountering rate limiting:

class RateLimitHandler
  def self.with_rate_limit_handling(max_retries: 5)
    retries = 0

    begin
      yield
    rescue Mechanize::ResponseCodeError => e
      if e.response_code == '429' && retries < max_retries
        retries += 1
        wait_time = (2 ** retries) + rand(1..5)  # Add jitter
        puts "Rate limited. Waiting #{wait_time} seconds before retry #{retries}/#{max_retries}"
        sleep(wait_time)
        retry
      else
        raise e
      end
    end
  end
end

Circuit Breaker Pattern

Implement a circuit breaker to avoid overwhelming failing services:

class CircuitBreaker
  def initialize(failure_threshold: 5, recovery_timeout: 60)
    @failure_threshold = failure_threshold
    @recovery_timeout = recovery_timeout
    @failure_count = 0
    @last_failure_time = nil
    @state = :closed  # :closed, :open, :half_open
  end

  def call
    if @state == :open
      if Time.now - @last_failure_time > @recovery_timeout
        @state = :half_open
      else
        raise "Circuit breaker is OPEN"
      end
    end

    begin
      result = yield
      reset if @state == :half_open
      result
    rescue StandardError => e
      record_failure
      raise e
    end
  end

  private

  def record_failure
    @failure_count += 1
    @last_failure_time = Time.now

    if @failure_count >= @failure_threshold
      @state = :open
      puts "Circuit breaker OPENED after #{@failure_count} failures"
    end
  end

  def reset
    @failure_count = 0
    @state = :closed
    puts "Circuit breaker CLOSED - service recovered"
  end
end

Form Handling Error Management

Safe Form Submission

Handle errors specific to form interactions:

def submit_form_safely(page, form_data)
  begin
    form = page.forms.first
    raise "No form found on page" if form.nil?

    # Populate form fields safely
    form_data.each do |field_name, value|
      field = form.field(field_name)
      if field
        field.value = value
      else
        puts "Warning: Field '#{field_name}' not found in form"
      end
    end

    # Submit form with error handling
    result_page = form.submit

    # Validate submission success
    if result_page.search('.error-message').any?
      error_messages = result_page.search('.error-message').map(&:text)
      raise "Form submission failed: #{error_messages.join(', ')}"
    end

    result_page

  rescue Mechanize::ElementNotFoundError => e
    puts "Form element not found: #{e.message}"
    return nil
  rescue StandardError => e
    puts "Form submission error: #{e.message}"
    return nil
  end
end

Logging and Monitoring

Comprehensive Logging Setup

Implement detailed logging for debugging and monitoring:

require 'logger'

class MechanizeScraper
  def initialize
    @agent = Mechanize.new
    @logger = Logger.new('scraper.log')
    @logger.level = Logger::INFO

    setup_mechanize_logging
  end

  private

  def setup_mechanize_logging
    @agent.log = @logger
    @agent.agent.http.debug_output = $stdout if ENV['DEBUG']
  end

  def log_request(url, success: true, error: nil)
    if success
      @logger.info("Successfully scraped: #{url}")
    else
      @logger.error("Failed to scrape #{url}: #{error}")
    end
  end

  def scrape_with_logging(url)
    start_time = Time.now

    begin
      @logger.info("Starting scrape of: #{url}")
      page = @agent.get(url)

      duration = Time.now - start_time
      @logger.info("Completed scrape of #{url} in #{duration.round(2)}s")

      log_request(url, success: true)
      page

    rescue StandardError => e
      duration = Time.now - start_time
      @logger.error("Failed scrape of #{url} after #{duration.round(2)}s: #{e.message}")

      log_request(url, success: false, error: e.message)
      nil
    end
  end
end

Error Recovery Strategies

Data Validation and Cleanup

Validate scraped data and handle parsing errors:

def validate_and_clean_data(page)
  data = {}

  begin
    # Safe text extraction with fallbacks
    data[:title] = extract_text_safely(page, '.title', 'Unknown Title')
    data[:price] = extract_price_safely(page, '.price')
    data[:description] = extract_text_safely(page, '.description', '')

    # Validate required fields
    validate_required_fields(data)

    data
  rescue DataValidationError => e
    puts "Data validation failed: #{e.message}"
    return nil
  end
end

def extract_text_safely(page, selector, default = nil)
  element = page.search(selector).first
  return default if element.nil?

  text = element.text.strip
  text.empty? ? default : text
rescue StandardError => e
  puts "Error extracting text from #{selector}: #{e.message}"
  default
end

def extract_price_safely(page, selector)
  price_text = extract_text_safely(page, selector, '0')

  # Clean and parse price
  cleaned_price = price_text.gsub(/[^\d.,]/, '')
  Float(cleaned_price)
rescue ArgumentError
  puts "Invalid price format: #{price_text}"
  0.0
end

Best Practices Summary

Configuration Best Practices

def configure_robust_agent
  agent = Mechanize.new

  # Set reasonable timeouts
  agent.open_timeout = 10
  agent.read_timeout = 30

  # Configure user agent rotation
  agent.user_agent_alias = ['Windows Chrome', 'Mac Chrome', 'Linux Firefox'].sample

  # Handle redirects
  agent.redirect_ok = true
  agent.redirection_limit = 5

  # SSL configuration
  agent.verify_mode = OpenSSL::SSL::VERIFY_PEER

  # Cookie management
  agent.cookie_jar.clear!

  agent
end

Error Handling Checklist

Wrap all network operations in appropriate exception handlers
Implement retry logic with exponential backoff for transient errors
Log all errors with sufficient detail for debugging
Validate data before processing to catch parsing issues early
Use circuit breakers for external service dependencies
Set appropriate timeouts to avoid hanging requests
Handle rate limiting gracefully with proper delays
Monitor and alert on error rates and patterns

Alternative Approaches

While Mechanize is excellent for form-based scraping, consider how to handle errors in Puppeteer for JavaScript-heavy sites that require more advanced error handling capabilities. For scenarios involving complex user interactions, explore how to handle authentication in Puppeteer which provides additional error handling context for authentication workflows.

Conclusion

Effective error handling in Mechanize scripts requires a multi-layered approach that addresses network issues, HTTP errors, parsing problems, and data validation. By implementing comprehensive error handling strategies including retry logic, circuit breakers, proper logging, and data validation, you can build robust web scraping applications that gracefully handle the unpredictable nature of web environments.

Remember to always test your error handling code with various failure scenarios and monitor your production scrapers to identify and address new error patterns as they emerge.

Table of contents