How do you optimize Mechanize scripts for better performance and reliability?

Optimizing Mechanize scripts is crucial for building robust, efficient web scraping applications. This guide covers comprehensive strategies to enhance performance, improve reliability, and ensure your Mechanize-based scrapers can handle production workloads effectively.

Performance Optimization Strategies

1. Connection and Timeout Management

Proper connection management is fundamental for performance optimization:

require 'mechanize'

# Create an optimized agent
agent = Mechanize.new do |a|
  # Set reasonable timeouts
  a.open_timeout = 10      # Connection timeout
  a.read_timeout = 30      # Read timeout
  a.idle_timeout = 5       # Keep-alive timeout

  # Enable connection pooling
  a.keep_alive = true
  a.max_history = 0        # Disable page history to save memory

  # Optimize SSL settings
  a.verify_mode = OpenSSL::SSL::VERIFY_PEER
  a.ca_file = '/etc/ssl/certs/ca-certificates.crt'
end

# Configure user agent rotation
user_agents = [
  'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
  'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36'
]

agent.user_agent = user_agents.sample

2. Memory Management

Prevent memory leaks in long-running scripts:

class OptimizedScraper
  def initialize
    @agent = Mechanize.new
    @agent.max_history = 0  # Crucial for memory optimization
    @processed_urls = Set.new
  end

  def scrape_pages(urls)
    urls.each_with_index do |url, index|
      begin
        process_page(url)

        # Periodic garbage collection for large datasets
        if index % 100 == 0
          GC.start
          puts "Processed #{index} pages, memory: #{memory_usage}MB"
        end

      rescue => e
        handle_error(e, url)
      end
    end
  end

  private

  def process_page(url)
    return if @processed_urls.include?(url)

    page = @agent.get(url)
    extract_data(page)

    @processed_urls.add(url)

    # Clear page references to free memory
    page = nil
  end

  def memory_usage
    `ps -o rss= -p #{Process.pid}`.to_i / 1024
  end
end

3. Concurrent Processing

Implement thread-safe concurrent processing for better throughput:

require 'concurrent-ruby'
require 'mechanize'

class ConcurrentScraper
  def initialize(max_threads: 5)
    @max_threads = max_threads
    @thread_pool = Concurrent::ThreadPoolExecutor.new(
      min_threads: 1,
      max_threads: @max_threads,
      max_queue: 100
    )
  end

  def scrape_urls(urls)
    futures = urls.map do |url|
      Concurrent::Future.execute(executor: @thread_pool) do
        scrape_single_url(url)
      end
    end

    # Wait for all tasks to complete
    results = futures.map(&:value)
    @thread_pool.shutdown
    @thread_pool.wait_for_termination

    results.compact
  end

  private

  def scrape_single_url(url)
    # Each thread gets its own agent instance
    agent = create_agent

    begin
      page = agent.get(url)
      extract_data(page)
    rescue => e
      puts "Error scraping #{url}: #{e.message}"
      nil
    ensure
      agent&.shutdown
    end
  end

  def create_agent
    Mechanize.new do |a|
      a.open_timeout = 10
      a.read_timeout = 30
      a.max_history = 0
      a.user_agent = random_user_agent
    end
  end
end

Reliability Enhancement Techniques

1. Robust Error Handling and Retry Logic

Implement comprehensive error handling with exponential backoff:

class ReliableScraper
  MAX_RETRIES = 3
  BASE_DELAY = 1

  def fetch_with_retry(url, retries: 0)
    agent.get(url)
  rescue Net::TimeoutError, Net::HTTPError => e
    if retries < MAX_RETRIES
      delay = BASE_DELAY * (2 ** retries) + rand(0.1..0.5)
      puts "Retry #{retries + 1} for #{url} after #{delay}s (#{e.class})"
      sleep(delay)
      fetch_with_retry(url, retries: retries + 1)
    else
      puts "Failed to fetch #{url} after #{MAX_RETRIES} retries"
      handle_permanent_failure(url, e)
      nil
    end
  rescue Mechanize::ResponseCodeError => e
    case e.response_code
    when '404', '410'
      puts "Page not found: #{url}"
      nil
    when '429', '503'
      # Rate limited or service unavailable
      backoff_time = extract_retry_after(e.page) || 60
      puts "Rate limited, backing off for #{backoff_time}s"
      sleep(backoff_time)
      fetch_with_retry(url, retries: retries + 1) if retries < MAX_RETRIES
    else
      raise e
    end
  end

  private

  def extract_retry_after(page)
    retry_after = page.response['Retry-After']
    retry_after&.to_i
  end

  def handle_permanent_failure(url, error)
    # Log to file, database, or monitoring system
    File.open('failed_urls.log', 'a') do |f|
      f.puts "#{Time.now}: #{url} - #{error.class}: #{error.message}"
    end
  end
end

2. Rate Limiting and Respectful Scraping

Implement adaptive rate limiting to avoid being blocked:

class RateLimitedScraper
  def initialize
    @agent = Mechanize.new
    @last_request_time = Time.now
    @request_count = 0
    @rate_limit_window = 60  # seconds
    @max_requests_per_window = 60
  end

  def get_page(url)
    enforce_rate_limit

    start_time = Time.now
    page = @agent.get(url)
    response_time = Time.now - start_time

    # Adaptive delay based on response time
    adaptive_delay = calculate_adaptive_delay(response_time)
    sleep(adaptive_delay) if adaptive_delay > 0

    @request_count += 1
    @last_request_time = Time.now

    page
  end

  private

  def enforce_rate_limit
    now = Time.now
    window_start = now - @rate_limit_window

    if @request_count >= @max_requests_per_window
      sleep_time = @rate_limit_window - (now - @last_request_time)
      if sleep_time > 0
        puts "Rate limit reached, sleeping for #{sleep_time.round(2)}s"
        sleep(sleep_time)
        @request_count = 0
      end
    end
  end

  def calculate_adaptive_delay(response_time)
    case response_time
    when 0..1
      0.5  # Fast response, minimal delay
    when 1..3
      1.0  # Normal response, standard delay
    when 3..10
      2.0  # Slow response, longer delay
    else
      5.0  # Very slow response, significant delay
    end
  end
end

3. Session Management and Cookie Persistence

Maintain session state across requests for better reliability:

class SessionAwareScraper
  def initialize(cookie_jar_path: 'cookies.yml')
    @cookie_jar_path = cookie_jar_path
    @agent = Mechanize.new
    load_cookies

    # Set up automatic cookie saving
    at_exit { save_cookies }
  end

  def login(username, password, login_url)
    login_page = @agent.get(login_url)
    login_form = login_page.form_with(action: /login|signin/)

    return false unless login_form

    login_form.field_with(name: /username|email/).value = username
    login_form.field_with(name: /password/).value = password

    result_page = @agent.submit(login_form)

    # Verify login success
    login_successful = !result_page.uri.to_s.include?('login') &&
                      !result_page.search('error, .alert-danger').any?

    save_cookies if login_successful
    login_successful
  end

  def scrape_with_session(urls)
    results = []

    urls.each do |url|
      begin
        page = @agent.get(url)

        # Check if session expired
        if session_expired?(page)
          puts "Session expired, attempting re-login..."
          if re_authenticate
            page = @agent.get(url)  # Retry after re-authentication
          else
            puts "Re-authentication failed"
            break
          end
        end

        results << extract_data(page)

      rescue => e
        puts "Error processing #{url}: #{e.message}"
      end
    end

    results
  end

  private

  def load_cookies
    if File.exist?(@cookie_jar_path)
      @agent.cookie_jar.load(@cookie_jar_path)
      puts "Loaded cookies from #{@cookie_jar_path}"
    end
  end

  def save_cookies
    @agent.cookie_jar.save(@cookie_jar_path)
    puts "Saved cookies to #{@cookie_jar_path}"
  end

  def session_expired?(page)
    page.uri.to_s.include?('login') ||
    page.search('login-required, .session-expired').any?
  end
end

Advanced Optimization Techniques

1. Proxy Rotation and IP Management

For large-scale scraping, implement proxy rotation to distribute requests:

class ProxyRotatingScraper
  def initialize(proxy_list)
    @proxies = proxy_list.cycle
    @current_proxy = nil
    @failed_proxies = Set.new
    @agent = nil
    setup_agent
  end

  def get_page_with_proxy_rotation(url, max_proxy_attempts: 3)
    attempts = 0

    begin
      page = @agent.get(url)
      reset_proxy_failure_count if page
      page

    rescue => e
      attempts += 1
      puts "Proxy #{@current_proxy} failed: #{e.message}"

      if attempts < max_proxy_attempts
        rotate_proxy
        retry
      else
        raise "All proxy attempts failed for #{url}"
      end
    end
  end

  private

  def setup_agent
    rotate_proxy
  end

  def rotate_proxy
    loop do
      @current_proxy = @proxies.next
      break unless @failed_proxies.include?(@current_proxy)

      # If all proxies failed, clear the failed set and try again
      if @failed_proxies.size >= proxy_count
        @failed_proxies.clear
        puts "Cleared failed proxies list, retrying all proxies"
        break
      end
    end

    create_agent_with_proxy(@current_proxy)
  end

  def create_agent_with_proxy(proxy)
    @agent = Mechanize.new do |a|
      if proxy[:type] == 'http'
        a.set_proxy(proxy[:host], proxy[:port], proxy[:user], proxy[:password])
      end

      a.open_timeout = 15
      a.read_timeout = 30
      a.user_agent = random_user_agent
    end

    puts "Using proxy: #{proxy[:host]}:#{proxy[:port]}"
  end

  def reset_proxy_failure_count
    @failed_proxies.delete(@current_proxy)
  end
end

2. Intelligent Content Parsing

Optimize parsing for better performance and reliability:

class OptimizedParser
  def extract_data_efficiently(page)
    # Use CSS selectors instead of XPath when possible (faster)
    title = page.at_css('h1, .title, [data-title]')&.text&.strip

    # Cache frequently used selectors
    @content_selector ||= 'article, .content, .post-body, main'
    content = page.at_css(@content_selector)&.text&.strip

    # Batch process multiple elements
    links = page.css('a[href]').map do |link|
      {
        text: link.text.strip,
        href: link['href'],
        title: link['title']
      }
    end.reject { |link| link[:text].empty? }

    # Use lazy evaluation for expensive operations
    images = lazy_extract_images(page) if needs_images?

    {
      title: title,
      content: content,
      links: links,
      images: images,
      scraped_at: Time.now
    }
  end

  private

  def lazy_extract_images(page)
    page.css('img[src]').lazy.map do |img|
      src = img['src']
      next if src.nil? || src.start_with?('data:')

      {
        src: absolute_url(src, page.uri),
        alt: img['alt'],
        title: img['title']
      }
    end.reject(&:nil?).force
  end

  def absolute_url(relative_url, base_uri)
    URI.join(base_uri, relative_url).to_s
  rescue URI::InvalidURIError
    relative_url
  end
end

3. Monitoring and Logging

Implement comprehensive monitoring for production environments:

class MonitoredScraper
  def initialize
    @agent = Mechanize.new
    @stats = {
      requests: 0,
      successes: 0,
      failures: 0,
      start_time: Time.now
    }

    setup_logging
  end

  def scrape_with_monitoring(urls)
    urls.each do |url|
      begin
        start_time = Time.now
        page = @agent.get(url)
        response_time = Time.now - start_time

        log_success(url, response_time)
        @stats[:successes] += 1

        yield page if block_given?

      rescue => e
        log_error(url, e)
        @stats[:failures] += 1
      ensure
        @stats[:requests] += 1

        # Report stats periodically
        report_stats if @stats[:requests] % 100 == 0
      end
    end

    final_report
  end

  private

  def setup_logging
    @logger = Logger.new('scraper.log', 'daily')
    @logger.level = Logger::INFO
    @logger.formatter = proc do |severity, datetime, progname, msg|
      "#{datetime.strftime('%Y-%m-%d %H:%M:%S')} [#{severity}] #{msg}\n"
    end
  end

  def log_success(url, response_time)
    @logger.info("SUCCESS: #{url} (#{response_time.round(3)}s)")
  end

  def log_error(url, error)
    @logger.error("ERROR: #{url} - #{error.class}: #{error.message}")
  end

  def report_stats
    uptime = Time.now - @stats[:start_time]
    success_rate = (@stats[:successes].to_f / @stats[:requests] * 100).round(2)
    rate = (@stats[:requests] / uptime * 60).round(2)

    puts "\n--- Stats Report ---"
    puts "Requests: #{@stats[:requests]}"
    puts "Success Rate: #{success_rate}%"
    puts "Rate: #{rate} requests/minute"
    puts "Uptime: #{uptime.round(0)}s"
    puts "-------------------\n"
  end

  def final_report
    report_stats
    @logger.info("Scraping session completed: #{@stats}")
  end
end

Database Integration and Data Storage

Efficiently store scraped data to avoid bottlenecks:

require 'sqlite3'
require 'json'

class DatabaseOptimizedScraper
  def initialize(db_path: 'scraped_data.db')
    @db = SQLite3::Database.new(db_path)
    @batch_size = 100
    @batch_data = []

    setup_database
  end

  def scrape_and_store(urls)
    urls.each_with_index do |url, index|
      begin
        page = @agent.get(url)
        data = extract_data(page)

        # Batch insert for better performance
        @batch_data << data

        if @batch_data.size >= @batch_size
          insert_batch
          @batch_data.clear
        end

      rescue => e
        puts "Error processing #{url}: #{e.message}"
      end
    end

    # Insert remaining data
    insert_batch unless @batch_data.empty?
  end

  private

  def setup_database
    @db.execute <<-SQL
      CREATE TABLE IF NOT EXISTS scraped_pages (
        id INTEGER PRIMARY KEY AUTOINCREMENT,
        url TEXT UNIQUE,
        title TEXT,
        content TEXT,
        metadata TEXT,
        scraped_at DATETIME DEFAULT CURRENT_TIMESTAMP
      )
    SQL

    # Create index for faster lookups
    @db.execute "CREATE INDEX IF NOT EXISTS idx_url ON scraped_pages(url)"
  end

  def insert_batch
    @db.transaction do
      stmt = @db.prepare(
        "INSERT OR REPLACE INTO scraped_pages (url, title, content, metadata) 
         VALUES (?, ?, ?, ?)"
      )

      @batch_data.each do |data|
        stmt.execute(
          data[:url],
          data[:title],
          data[:content],
          data[:metadata].to_json
        )
      end

      stmt.close
    end

    puts "Inserted batch of #{@batch_data.size} records"
  end
end

Configuration Management

Centralize configuration for easier optimization:

# config/scraper_config.yml
development:
  timeouts:
    open_timeout: 10
    read_timeout: 30
    idle_timeout: 5

  rate_limiting:
    requests_per_minute: 60
    adaptive_delay: true
    respect_retry_after: true

  concurrency:
    max_threads: 5
    thread_pool_size: 10

  memory:
    max_history: 0
    gc_frequency: 100

  reliability:
    max_retries: 3
    base_delay: 1
    exponential_backoff: true

production:
  timeouts:
    open_timeout: 15
    read_timeout: 45
    idle_timeout: 10

  rate_limiting:
    requests_per_minute: 30
    adaptive_delay: true
    respect_retry_after: true

  concurrency:
    max_threads: 10
    thread_pool_size: 20

  memory:
    max_history: 0
    gc_frequency: 50

  reliability:
    max_retries: 5
    base_delay: 2
    exponential_backoff: true

require 'yaml'

class ConfigurableScraper
  def initialize(env: 'development')
    @config = YAML.load_file('config/scraper_config.yml')[env]
    @agent = create_configured_agent
  end

  private

  def create_configured_agent
    Mechanize.new do |a|
      # Apply timeout settings
      a.open_timeout = @config['timeouts']['open_timeout']
      a.read_timeout = @config['timeouts']['read_timeout']
      a.idle_timeout = @config['timeouts']['idle_timeout']

      # Apply memory settings
      a.max_history = @config['memory']['max_history']
      a.keep_alive = true

      # Set up user agent rotation
      a.user_agent = random_user_agent
    end
  end
end

Best Practices Summary

Performance Optimization Checklist

Connection Management: Use keep-alive connections and appropriate timeouts
Memory Management: Disable page history and implement periodic garbage collection
Concurrent Processing: Use thread pools for parallel processing
Caching: Cache frequently accessed elements and selectors
Efficient Parsing: Prefer CSS selectors over XPath when possible
Database Optimization: Use batch inserts and proper indexing

Reliability Enhancement Checklist

Error Handling: Implement retry logic with exponential backoff
Rate Limiting: Respect server resources with adaptive delays
Session Management: Persist cookies and handle session expiration
Monitoring: Log all activities and track performance metrics
Proxy Rotation: Distribute requests across multiple IP addresses
Configuration Management: Use environment-specific settings

Command Line Monitoring

Monitor your scraper's performance in real-time:

# Monitor memory usage
watch -n 5 'ps aux | grep ruby | grep -v grep'

# Monitor network connections
netstat -an | grep :80 | wc -l

# Monitor log files
tail -f scraper.log | grep ERROR

# Check system load
uptime && free -h

When building large-scale scrapers, consider complementing Mechanize with browser automation tools for JavaScript-heavy sites or implementing robust error handling patterns that can be adapted across different scraping technologies.

By implementing these optimization strategies, your Mechanize scripts will be more performant, reliable, and capable of handling production workloads while maintaining respectful scraping practices. Remember to always monitor your scrapers in production and adjust configurations based on the specific requirements of your target websites.

Table of contents