What Logging and Monitoring Options Are Available for Mechanize Applications?
When building web scraping applications with Ruby's Mechanize library, proper logging and monitoring are essential for debugging, performance optimization, and maintaining reliable scraping operations. Mechanize provides several built-in logging capabilities, and you can extend these with custom monitoring solutions to create robust, production-ready scraping applications.
Built-in Mechanize Logging
Basic HTTP Request Logging
Mechanize includes built-in logging capabilities that can track HTTP requests, responses, and various internal operations. The primary logging mechanism uses Ruby's standard Logger
class:
require 'mechanize'
require 'logger'
# Create a logger instance
logger = Logger.new(STDOUT)
logger.level = Logger::DEBUG
# Create Mechanize agent with logging
agent = Mechanize.new do |a|
a.log = logger
end
# Your scraping operations will now be logged
page = agent.get('https://example.com')
Detailed Request/Response Logging
For more detailed logging that includes request headers, response codes, and timing information:
require 'mechanize'
require 'logger'
class DetailedLogger < Logger
def initialize(logdev)
super(logdev)
self.level = DEBUG
self.formatter = proc do |severity, datetime, progname, msg|
"[#{datetime.strftime('%Y-%m-%d %H:%M:%S')}] #{severity}: #{msg}\n"
end
end
end
agent = Mechanize.new do |a|
a.log = DetailedLogger.new('mechanize.log')
a.user_agent = 'My Scraper 1.0'
end
# Enable verbose logging for debugging
agent.agent.http.debug_output = $stderr if ENV['DEBUG']
Log Levels and Filtering
Configure different log levels to control the verbosity of your logging:
require 'mechanize'
require 'logger'
logger = Logger.new('scraper.log', 'daily')
logger.level = case ENV['LOG_LEVEL']
when 'DEBUG' then Logger::DEBUG
when 'INFO' then Logger::INFO
when 'WARN' then Logger::WARN
when 'ERROR' then Logger::ERROR
else Logger::INFO
end
agent = Mechanize.new { |a| a.log = logger }
# Log custom messages at different levels
logger.info "Starting scraping session"
logger.debug "Processing page: #{url}"
logger.warn "Rate limit detected, waiting..."
logger.error "Failed to parse response"
Custom Logging and Monitoring
Request/Response Middleware
Create custom middleware to log specific aspects of your scraping operations:
class ScrapingMonitor
attr_reader :request_count, :error_count, :start_time
def initialize
@request_count = 0
@error_count = 0
@start_time = Time.now
@logger = Logger.new('scraping_monitor.log')
end
def log_request(url, method = 'GET')
@request_count += 1
@logger.info "Request ##{@request_count}: #{method} #{url}"
end
def log_response(response, duration)
@logger.info "Response: #{response.code} (#{duration.round(3)}s)"
end
def log_error(error, url)
@error_count += 1
@logger.error "Error ##{@error_count} at #{url}: #{error.message}"
end
def stats
runtime = Time.now - @start_time
{
requests: @request_count,
errors: @error_count,
runtime: runtime,
requests_per_second: @request_count / runtime
}
end
end
# Usage with Mechanize
monitor = ScrapingMonitor.new
agent = Mechanize.new
begin
start_time = Time.now
monitor.log_request(url)
page = agent.get(url)
duration = Time.now - start_time
monitor.log_response(page, duration)
rescue => error
monitor.log_error(error, url)
end
puts monitor.stats
Performance Monitoring
Track performance metrics for your scraping operations:
class PerformanceTracker
def initialize
@metrics = {}
@logger = Logger.new('performance.log')
end
def track(operation_name)
start_time = Time.now
result = yield
duration = Time.now - start_time
@metrics[operation_name] ||= []
@metrics[operation_name] << duration
@logger.info "#{operation_name}: #{duration.round(3)}s"
result
end
def report
@metrics.each do |operation, times|
avg_time = times.sum / times.length
max_time = times.max
min_time = times.min
@logger.info "#{operation} - Avg: #{avg_time.round(3)}s, " \
"Max: #{max_time.round(3)}s, Min: #{min_time.round(3)}s"
end
end
end
# Usage
tracker = PerformanceTracker.new
agent = Mechanize.new
result = tracker.track('page_fetch') do
agent.get('https://example.com')
end
parsed_data = tracker.track('data_extraction') do
# Your data extraction logic here
result.search('.content').map(&:text)
end
tracker.report
Error Monitoring and Alerting
Comprehensive Error Handling
Implement robust error handling with detailed logging:
class ScrapingErrorHandler
def initialize(logger = nil)
@logger = logger || Logger.new(STDERR)
@error_counts = Hash.new(0)
end
def handle_error(error, context = {})
error_type = error.class.name
@error_counts[error_type] += 1
error_details = {
type: error_type,
message: error.message,
backtrace: error.backtrace.first(5),
context: context,
timestamp: Time.now,
count: @error_counts[error_type]
}
@logger.error error_details.to_json
# Alert if error count exceeds threshold
alert_on_error_threshold(error_type) if @error_counts[error_type] >= 5
error_details
end
private
def alert_on_error_threshold(error_type)
@logger.fatal "ALERT: #{error_type} occurred #{@error_counts[error_type]} times"
# Add your alerting logic here (email, Slack, etc.)
end
end
# Usage
error_handler = ScrapingErrorHandler.new
agent = Mechanize.new
begin
page = agent.get(url)
rescue Net::HTTPError => e
error_handler.handle_error(e, { url: url, retry_count: 0 })
rescue Mechanize::Error => e
error_handler.handle_error(e, { url: url, user_agent: agent.user_agent })
rescue => e
error_handler.handle_error(e, { url: url, unexpected: true })
end
Integration with External Monitoring Services
Connect your Mechanize application with external monitoring services:
require 'net/http'
require 'json'
class ExternalMonitor
def initialize(webhook_url, logger = nil)
@webhook_url = webhook_url
@logger = logger || Logger.new(STDOUT)
end
def send_metric(metric_name, value, tags = {})
payload = {
metric: metric_name,
value: value,
tags: tags,
timestamp: Time.now.to_i
}
begin
uri = URI(@webhook_url)
http = Net::HTTP.new(uri.host, uri.port)
http.use_ssl = true if uri.scheme == 'https'
request = Net::HTTP::Post.new(uri)
request['Content-Type'] = 'application/json'
request.body = payload.to_json
response = http.request(request)
@logger.debug "Metric sent: #{response.code}"
rescue => e
@logger.error "Failed to send metric: #{e.message}"
end
end
def send_alert(message, severity = 'warning')
alert_payload = {
message: message,
severity: severity,
service: 'mechanize-scraper',
timestamp: Time.now.to_i
}
# Send to your monitoring service
send_metric('alert', 1, alert_payload)
end
end
# Usage with Mechanize
monitor = ExternalMonitor.new(ENV['MONITORING_WEBHOOK_URL'])
agent = Mechanize.new
begin
start_time = Time.now
page = agent.get(url)
duration = Time.now - start_time
# Send performance metrics
monitor.send_metric('request_duration', duration, { url: url })
monitor.send_metric('request_success', 1, { url: url })
rescue => error
monitor.send_metric('request_error', 1, {
url: url,
error_type: error.class.name
})
monitor.send_alert("Scraping failed for #{url}: #{error.message}", 'error')
end
Debugging and Development Tools
Request/Response Inspection
For debugging purposes, implement detailed request and response inspection:
class RequestInspector
def initialize(agent)
@agent = agent
@logger = Logger.new('debug.log')
end
def inspect_request(url, method = :get)
@logger.debug "=== REQUEST DETAILS ==="
@logger.debug "URL: #{url}"
@logger.debug "Method: #{method.upcase}"
@logger.debug "User-Agent: #{@agent.user_agent}"
@logger.debug "Cookies: #{@agent.cookies.map(&:to_s).join('; ')}"
start_time = Time.now
begin
response = @agent.send(method, url)
duration = Time.now - start_time
@logger.debug "=== RESPONSE DETAILS ==="
@logger.debug "Status: #{response.code}"
@logger.debug "Content-Type: #{response.content_type}"
@logger.debug "Content-Length: #{response.body.length}"
@logger.debug "Duration: #{duration.round(3)}s"
@logger.debug "Final URL: #{response.uri}"
response
rescue => error
duration = Time.now - start_time
@logger.error "=== ERROR DETAILS ==="
@logger.error "Error: #{error.class.name}: #{error.message}"
@logger.error "Duration: #{duration.round(3)}s"
raise
end
end
end
# Usage
agent = Mechanize.new
inspector = RequestInspector.new(agent)
page = inspector.inspect_request('https://example.com')
Memory and Resource Monitoring
Monitor memory usage and resource consumption:
require 'objspace'
class ResourceMonitor
def initialize
@logger = Logger.new('resources.log')
@initial_memory = current_memory_usage
end
def current_memory_usage
`ps -o rss= -p #{Process.pid}`.to_i * 1024 # Convert KB to bytes
end
def log_memory_usage(context = '')
current_memory = current_memory_usage
memory_increase = current_memory - @initial_memory
@logger.info "Memory usage#{context.empty? ? '' : " (#{context})"}: " \
"#{(current_memory / 1024.0 / 1024.0).round(2)} MB " \
"(+#{(memory_increase / 1024.0 / 1024.0).round(2)} MB)"
end
def log_object_counts
object_counts = ObjectSpace.count_objects
@logger.debug "Object counts: #{object_counts}"
end
end
# Usage
resource_monitor = ResourceMonitor.new
agent = Mechanize.new
resource_monitor.log_memory_usage('start')
# Your scraping operations
pages = []
100.times do |i|
pages << agent.get("https://example.com/page/#{i}")
if (i + 1) % 10 == 0
resource_monitor.log_memory_usage("after #{i + 1} pages")
resource_monitor.log_object_counts
end
end
resource_monitor.log_memory_usage('end')
Production Monitoring Best Practices
Structured Logging
Implement structured logging for better analysis and monitoring:
require 'json'
require 'logger'
class StructuredLogger
def initialize(logdev)
@logger = Logger.new(logdev)
@logger.formatter = proc do |severity, datetime, progname, msg|
{
timestamp: datetime.iso8601,
level: severity,
message: msg.is_a?(Hash) ? msg : { text: msg },
service: 'mechanize-scraper'
}.to_json + "\n"
end
end
def log(level, data)
@logger.send(level.downcase, data)
end
def info(data); log(:info, data); end
def debug(data); log(:debug, data); end
def warn(data); log(:warn, data); end
def error(data); log(:error, data); end
end
# Usage
logger = StructuredLogger.new('app.log')
logger.info({
event: 'scraping_started',
url: 'https://example.com',
user_agent: agent.user_agent
})
logger.error({
event: 'scraping_failed',
url: 'https://example.com',
error_class: 'Net::TimeoutError',
error_message: 'execution expired'
})
Advanced Monitoring Techniques
When building production-grade scraping applications, consider implementing advanced monitoring similar to how to monitor network requests in Puppeteer for browser-based scraping. Additionally, for error handling strategies, you can learn from how to handle errors in Puppeteer which provides valuable insights applicable to any web scraping framework.
Health Checks and Status Endpoints
Implement health checks for your scraping services:
require 'sinatra'
require 'json'
class ScrapingHealthCheck
def initialize(agent)
@agent = agent
@last_successful_request = nil
@total_requests = 0
@failed_requests = 0
@start_time = Time.now
end
def record_success
@total_requests += 1
@last_successful_request = Time.now
end
def record_failure
@total_requests += 1
@failed_requests += 1
end
def health_status
{
status: overall_status,
last_successful_request: @last_successful_request,
total_requests: @total_requests,
failed_requests: @failed_requests,
success_rate: success_rate,
uptime: uptime
}
end
private
def overall_status
return 'unhealthy' if @last_successful_request.nil?
return 'degraded' if success_rate < 0.8
'healthy'
end
def success_rate
return 0 if @total_requests == 0
(@total_requests - @failed_requests).to_f / @total_requests
end
def uptime
Time.now - @start_time
end
end
# Sinatra endpoint for health checks
health_check = ScrapingHealthCheck.new(agent)
get '/health' do
content_type :json
health_check.health_status.to_json
end
Command-Line Monitoring Tools
Create command-line tools for monitoring your Mechanize applications:
#!/bin/bash
# monitor_scraper.sh - Simple monitoring script
LOG_FILE="mechanize.log"
ERROR_THRESHOLD=10
# Count errors in the last hour
recent_errors=$(grep "ERROR" "$LOG_FILE" | grep "$(date '+%Y-%m-%d %H')" | wc -l)
if [ "$recent_errors" -gt "$ERROR_THRESHOLD" ]; then
echo "ALERT: $recent_errors errors detected in the last hour"
# Send alert notification
curl -X POST "$WEBHOOK_URL" \
-H "Content-Type: application/json" \
-d "{\"text\": \"Mechanize scraper: $recent_errors errors in last hour\"}"
fi
# Monitor memory usage
memory_usage=$(ps aux | grep ruby | grep mechanize | awk '{print $6}')
echo "Memory usage: ${memory_usage}KB"
# Check if process is running
if ! pgrep -f "mechanize" > /dev/null; then
echo "WARNING: Mechanize process not found"
fi
Metrics Collection and Visualization
Integrate with time-series databases for long-term monitoring:
require 'influxdb'
class MetricsCollector
def initialize
@influxdb = InfluxDB::Client.new(
host: ENV['INFLUXDB_HOST'] || 'localhost',
port: ENV['INFLUXDB_PORT'] || 8086,
database: 'mechanize_metrics'
)
@logger = Logger.new('metrics.log')
end
def record_request(url, duration, status_code)
data = {
series: 'http_requests',
values: {
duration: duration,
status_code: status_code
},
tags: {
url: url,
service: 'mechanize-scraper'
}
}
begin
@influxdb.write_point(data[:series], data[:values], data[:tags])
@logger.debug "Metrics recorded for #{url}"
rescue => e
@logger.error "Failed to record metrics: #{e.message}"
end
end
def record_error(error_type, url)
data = {
series: 'scraping_errors',
values: { count: 1 },
tags: {
error_type: error_type,
url: url,
service: 'mechanize-scraper'
}
}
@influxdb.write_point(data[:series], data[:values], data[:tags])
end
end
# Usage
metrics = MetricsCollector.new
agent = Mechanize.new
begin
start_time = Time.now
page = agent.get(url)
duration = Time.now - start_time
metrics.record_request(url, duration, page.code)
rescue => error
metrics.record_error(error.class.name, url)
end
By implementing comprehensive logging and monitoring for your Mechanize applications, you can ensure reliable operation, quick problem identification, and optimal performance. These monitoring capabilities become especially important when scaling your scraping operations or running them in production environments where visibility into application behavior is crucial for maintaining service quality.
Remember to balance logging verbosity with performance considerations, and always ensure that sensitive information is not logged inadvertently. Regular monitoring and alerting will help you maintain robust and reliable web scraping applications.