Table of contents

What Logging and Monitoring Options Are Available for Mechanize Applications?

When building web scraping applications with Ruby's Mechanize library, proper logging and monitoring are essential for debugging, performance optimization, and maintaining reliable scraping operations. Mechanize provides several built-in logging capabilities, and you can extend these with custom monitoring solutions to create robust, production-ready scraping applications.

Built-in Mechanize Logging

Basic HTTP Request Logging

Mechanize includes built-in logging capabilities that can track HTTP requests, responses, and various internal operations. The primary logging mechanism uses Ruby's standard Logger class:

require 'mechanize'
require 'logger'

# Create a logger instance
logger = Logger.new(STDOUT)
logger.level = Logger::DEBUG

# Create Mechanize agent with logging
agent = Mechanize.new do |a|
  a.log = logger
end

# Your scraping operations will now be logged
page = agent.get('https://example.com')

Detailed Request/Response Logging

For more detailed logging that includes request headers, response codes, and timing information:

require 'mechanize'
require 'logger'

class DetailedLogger < Logger
  def initialize(logdev)
    super(logdev)
    self.level = DEBUG
    self.formatter = proc do |severity, datetime, progname, msg|
      "[#{datetime.strftime('%Y-%m-%d %H:%M:%S')}] #{severity}: #{msg}\n"
    end
  end
end

agent = Mechanize.new do |a|
  a.log = DetailedLogger.new('mechanize.log')
  a.user_agent = 'My Scraper 1.0'
end

# Enable verbose logging for debugging
agent.agent.http.debug_output = $stderr if ENV['DEBUG']

Log Levels and Filtering

Configure different log levels to control the verbosity of your logging:

require 'mechanize'
require 'logger'

logger = Logger.new('scraper.log', 'daily')
logger.level = case ENV['LOG_LEVEL']
               when 'DEBUG' then Logger::DEBUG
               when 'INFO' then Logger::INFO
               when 'WARN' then Logger::WARN
               when 'ERROR' then Logger::ERROR
               else Logger::INFO
               end

agent = Mechanize.new { |a| a.log = logger }

# Log custom messages at different levels
logger.info "Starting scraping session"
logger.debug "Processing page: #{url}"
logger.warn "Rate limit detected, waiting..."
logger.error "Failed to parse response"

Custom Logging and Monitoring

Request/Response Middleware

Create custom middleware to log specific aspects of your scraping operations:

class ScrapingMonitor
  attr_reader :request_count, :error_count, :start_time

  def initialize
    @request_count = 0
    @error_count = 0
    @start_time = Time.now
    @logger = Logger.new('scraping_monitor.log')
  end

  def log_request(url, method = 'GET')
    @request_count += 1
    @logger.info "Request ##{@request_count}: #{method} #{url}"
  end

  def log_response(response, duration)
    @logger.info "Response: #{response.code} (#{duration.round(3)}s)"
  end

  def log_error(error, url)
    @error_count += 1
    @logger.error "Error ##{@error_count} at #{url}: #{error.message}"
  end

  def stats
    runtime = Time.now - @start_time
    {
      requests: @request_count,
      errors: @error_count,
      runtime: runtime,
      requests_per_second: @request_count / runtime
    }
  end
end

# Usage with Mechanize
monitor = ScrapingMonitor.new
agent = Mechanize.new

begin
  start_time = Time.now
  monitor.log_request(url)

  page = agent.get(url)

  duration = Time.now - start_time
  monitor.log_response(page, duration)
rescue => error
  monitor.log_error(error, url)
end

puts monitor.stats

Performance Monitoring

Track performance metrics for your scraping operations:

class PerformanceTracker
  def initialize
    @metrics = {}
    @logger = Logger.new('performance.log')
  end

  def track(operation_name)
    start_time = Time.now
    result = yield
    duration = Time.now - start_time

    @metrics[operation_name] ||= []
    @metrics[operation_name] << duration

    @logger.info "#{operation_name}: #{duration.round(3)}s"

    result
  end

  def report
    @metrics.each do |operation, times|
      avg_time = times.sum / times.length
      max_time = times.max
      min_time = times.min

      @logger.info "#{operation} - Avg: #{avg_time.round(3)}s, " \
                   "Max: #{max_time.round(3)}s, Min: #{min_time.round(3)}s"
    end
  end
end

# Usage
tracker = PerformanceTracker.new
agent = Mechanize.new

result = tracker.track('page_fetch') do
  agent.get('https://example.com')
end

parsed_data = tracker.track('data_extraction') do
  # Your data extraction logic here
  result.search('.content').map(&:text)
end

tracker.report

Error Monitoring and Alerting

Comprehensive Error Handling

Implement robust error handling with detailed logging:

class ScrapingErrorHandler
  def initialize(logger = nil)
    @logger = logger || Logger.new(STDERR)
    @error_counts = Hash.new(0)
  end

  def handle_error(error, context = {})
    error_type = error.class.name
    @error_counts[error_type] += 1

    error_details = {
      type: error_type,
      message: error.message,
      backtrace: error.backtrace.first(5),
      context: context,
      timestamp: Time.now,
      count: @error_counts[error_type]
    }

    @logger.error error_details.to_json

    # Alert if error count exceeds threshold
    alert_on_error_threshold(error_type) if @error_counts[error_type] >= 5

    error_details
  end

  private

  def alert_on_error_threshold(error_type)
    @logger.fatal "ALERT: #{error_type} occurred #{@error_counts[error_type]} times"
    # Add your alerting logic here (email, Slack, etc.)
  end
end

# Usage
error_handler = ScrapingErrorHandler.new
agent = Mechanize.new

begin
  page = agent.get(url)
rescue Net::HTTPError => e
  error_handler.handle_error(e, { url: url, retry_count: 0 })
rescue Mechanize::Error => e
  error_handler.handle_error(e, { url: url, user_agent: agent.user_agent })
rescue => e
  error_handler.handle_error(e, { url: url, unexpected: true })
end

Integration with External Monitoring Services

Connect your Mechanize application with external monitoring services:

require 'net/http'
require 'json'

class ExternalMonitor
  def initialize(webhook_url, logger = nil)
    @webhook_url = webhook_url
    @logger = logger || Logger.new(STDOUT)
  end

  def send_metric(metric_name, value, tags = {})
    payload = {
      metric: metric_name,
      value: value,
      tags: tags,
      timestamp: Time.now.to_i
    }

    begin
      uri = URI(@webhook_url)
      http = Net::HTTP.new(uri.host, uri.port)
      http.use_ssl = true if uri.scheme == 'https'

      request = Net::HTTP::Post.new(uri)
      request['Content-Type'] = 'application/json'
      request.body = payload.to_json

      response = http.request(request)
      @logger.debug "Metric sent: #{response.code}"
    rescue => e
      @logger.error "Failed to send metric: #{e.message}"
    end
  end

  def send_alert(message, severity = 'warning')
    alert_payload = {
      message: message,
      severity: severity,
      service: 'mechanize-scraper',
      timestamp: Time.now.to_i
    }

    # Send to your monitoring service
    send_metric('alert', 1, alert_payload)
  end
end

# Usage with Mechanize
monitor = ExternalMonitor.new(ENV['MONITORING_WEBHOOK_URL'])
agent = Mechanize.new

begin
  start_time = Time.now
  page = agent.get(url)
  duration = Time.now - start_time

  # Send performance metrics
  monitor.send_metric('request_duration', duration, { url: url })
  monitor.send_metric('request_success', 1, { url: url })

rescue => error
  monitor.send_metric('request_error', 1, { 
    url: url, 
    error_type: error.class.name 
  })
  monitor.send_alert("Scraping failed for #{url}: #{error.message}", 'error')
end

Debugging and Development Tools

Request/Response Inspection

For debugging purposes, implement detailed request and response inspection:

class RequestInspector
  def initialize(agent)
    @agent = agent
    @logger = Logger.new('debug.log')
  end

  def inspect_request(url, method = :get)
    @logger.debug "=== REQUEST DETAILS ==="
    @logger.debug "URL: #{url}"
    @logger.debug "Method: #{method.upcase}"
    @logger.debug "User-Agent: #{@agent.user_agent}"
    @logger.debug "Cookies: #{@agent.cookies.map(&:to_s).join('; ')}"

    start_time = Time.now

    begin
      response = @agent.send(method, url)
      duration = Time.now - start_time

      @logger.debug "=== RESPONSE DETAILS ==="
      @logger.debug "Status: #{response.code}"
      @logger.debug "Content-Type: #{response.content_type}"
      @logger.debug "Content-Length: #{response.body.length}"
      @logger.debug "Duration: #{duration.round(3)}s"
      @logger.debug "Final URL: #{response.uri}"

      response
    rescue => error
      duration = Time.now - start_time
      @logger.error "=== ERROR DETAILS ==="
      @logger.error "Error: #{error.class.name}: #{error.message}"
      @logger.error "Duration: #{duration.round(3)}s"
      raise
    end
  end
end

# Usage
agent = Mechanize.new
inspector = RequestInspector.new(agent)

page = inspector.inspect_request('https://example.com')

Memory and Resource Monitoring

Monitor memory usage and resource consumption:

require 'objspace'

class ResourceMonitor
  def initialize
    @logger = Logger.new('resources.log')
    @initial_memory = current_memory_usage
  end

  def current_memory_usage
    `ps -o rss= -p #{Process.pid}`.to_i * 1024 # Convert KB to bytes
  end

  def log_memory_usage(context = '')
    current_memory = current_memory_usage
    memory_increase = current_memory - @initial_memory

    @logger.info "Memory usage#{context.empty? ? '' : " (#{context})"}: " \
                 "#{(current_memory / 1024.0 / 1024.0).round(2)} MB " \
                 "(+#{(memory_increase / 1024.0 / 1024.0).round(2)} MB)"
  end

  def log_object_counts
    object_counts = ObjectSpace.count_objects
    @logger.debug "Object counts: #{object_counts}"
  end
end

# Usage
resource_monitor = ResourceMonitor.new
agent = Mechanize.new

resource_monitor.log_memory_usage('start')

# Your scraping operations
pages = []
100.times do |i|
  pages << agent.get("https://example.com/page/#{i}")

  if (i + 1) % 10 == 0
    resource_monitor.log_memory_usage("after #{i + 1} pages")
    resource_monitor.log_object_counts
  end
end

resource_monitor.log_memory_usage('end')

Production Monitoring Best Practices

Structured Logging

Implement structured logging for better analysis and monitoring:

require 'json'
require 'logger'

class StructuredLogger
  def initialize(logdev)
    @logger = Logger.new(logdev)
    @logger.formatter = proc do |severity, datetime, progname, msg|
      {
        timestamp: datetime.iso8601,
        level: severity,
        message: msg.is_a?(Hash) ? msg : { text: msg },
        service: 'mechanize-scraper'
      }.to_json + "\n"
    end
  end

  def log(level, data)
    @logger.send(level.downcase, data)
  end

  def info(data); log(:info, data); end
  def debug(data); log(:debug, data); end
  def warn(data); log(:warn, data); end
  def error(data); log(:error, data); end
end

# Usage
logger = StructuredLogger.new('app.log')

logger.info({
  event: 'scraping_started',
  url: 'https://example.com',
  user_agent: agent.user_agent
})

logger.error({
  event: 'scraping_failed',
  url: 'https://example.com',
  error_class: 'Net::TimeoutError',
  error_message: 'execution expired'
})

Advanced Monitoring Techniques

When building production-grade scraping applications, consider implementing advanced monitoring similar to how to monitor network requests in Puppeteer for browser-based scraping. Additionally, for error handling strategies, you can learn from how to handle errors in Puppeteer which provides valuable insights applicable to any web scraping framework.

Health Checks and Status Endpoints

Implement health checks for your scraping services:

require 'sinatra'
require 'json'

class ScrapingHealthCheck
  def initialize(agent)
    @agent = agent
    @last_successful_request = nil
    @total_requests = 0
    @failed_requests = 0
    @start_time = Time.now
  end

  def record_success
    @total_requests += 1
    @last_successful_request = Time.now
  end

  def record_failure
    @total_requests += 1
    @failed_requests += 1
  end

  def health_status
    {
      status: overall_status,
      last_successful_request: @last_successful_request,
      total_requests: @total_requests,
      failed_requests: @failed_requests,
      success_rate: success_rate,
      uptime: uptime
    }
  end

  private

  def overall_status
    return 'unhealthy' if @last_successful_request.nil?
    return 'degraded' if success_rate < 0.8
    'healthy'
  end

  def success_rate
    return 0 if @total_requests == 0
    (@total_requests - @failed_requests).to_f / @total_requests
  end

  def uptime
    Time.now - @start_time
  end
end

# Sinatra endpoint for health checks
health_check = ScrapingHealthCheck.new(agent)

get '/health' do
  content_type :json
  health_check.health_status.to_json
end

Command-Line Monitoring Tools

Create command-line tools for monitoring your Mechanize applications:

#!/bin/bash
# monitor_scraper.sh - Simple monitoring script

LOG_FILE="mechanize.log"
ERROR_THRESHOLD=10

# Count errors in the last hour
recent_errors=$(grep "ERROR" "$LOG_FILE" | grep "$(date '+%Y-%m-%d %H')" | wc -l)

if [ "$recent_errors" -gt "$ERROR_THRESHOLD" ]; then
    echo "ALERT: $recent_errors errors detected in the last hour"
    # Send alert notification
    curl -X POST "$WEBHOOK_URL" \
         -H "Content-Type: application/json" \
         -d "{\"text\": \"Mechanize scraper: $recent_errors errors in last hour\"}"
fi

# Monitor memory usage
memory_usage=$(ps aux | grep ruby | grep mechanize | awk '{print $6}')
echo "Memory usage: ${memory_usage}KB"

# Check if process is running
if ! pgrep -f "mechanize" > /dev/null; then
    echo "WARNING: Mechanize process not found"
fi

Metrics Collection and Visualization

Integrate with time-series databases for long-term monitoring:

require 'influxdb'

class MetricsCollector
  def initialize
    @influxdb = InfluxDB::Client.new(
      host: ENV['INFLUXDB_HOST'] || 'localhost',
      port: ENV['INFLUXDB_PORT'] || 8086,
      database: 'mechanize_metrics'
    )
    @logger = Logger.new('metrics.log')
  end

  def record_request(url, duration, status_code)
    data = {
      series: 'http_requests',
      values: {
        duration: duration,
        status_code: status_code
      },
      tags: {
        url: url,
        service: 'mechanize-scraper'
      }
    }

    begin
      @influxdb.write_point(data[:series], data[:values], data[:tags])
      @logger.debug "Metrics recorded for #{url}"
    rescue => e
      @logger.error "Failed to record metrics: #{e.message}"
    end
  end

  def record_error(error_type, url)
    data = {
      series: 'scraping_errors',
      values: { count: 1 },
      tags: {
        error_type: error_type,
        url: url,
        service: 'mechanize-scraper'
      }
    }

    @influxdb.write_point(data[:series], data[:values], data[:tags])
  end
end

# Usage
metrics = MetricsCollector.new
agent = Mechanize.new

begin
  start_time = Time.now
  page = agent.get(url)
  duration = Time.now - start_time

  metrics.record_request(url, duration, page.code)
rescue => error
  metrics.record_error(error.class.name, url)
end

By implementing comprehensive logging and monitoring for your Mechanize applications, you can ensure reliable operation, quick problem identification, and optimal performance. These monitoring capabilities become especially important when scaling your scraping operations or running them in production environments where visibility into application behavior is crucial for maintaining service quality.

Remember to balance logging verbosity with performance considerations, and always ensure that sensitive information is not logged inadvertently. Regular monitoring and alerting will help you maintain robust and reliable web scraping applications.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon