How do I implement proper logging for Ruby web scraping projects?
Proper logging is essential for Ruby web scraping projects to monitor performance, debug issues, track success rates, and maintain reliable scraping operations. This guide covers comprehensive logging strategies, from basic setup to advanced structured logging patterns.
Why Logging Matters in Web Scraping
Web scraping involves numerous potential failure points: network timeouts, rate limiting, HTML structure changes, and anti-bot measures. Effective logging helps you:
- Debug scraping failures and understand why requests fail
- Monitor scraping performance and identify bottlenecks
- Track success rates and data quality metrics
- Comply with rate limits and avoid getting blocked
- Maintain audit trails for compliance and debugging
Basic Logging Setup with Ruby's Logger
Ruby's built-in Logger
class provides a solid foundation for web scraping projects:
require 'logger'
require 'net/http'
require 'nokogiri'
class WebScraper
def initialize
@logger = Logger.new(STDOUT)
@logger.level = Logger::INFO
@logger.formatter = proc do |severity, datetime, progname, msg|
"[#{datetime}] #{severity}: #{msg}\n"
end
end
def scrape_page(url)
@logger.info "Starting scrape for #{url}"
begin
response = fetch_page(url)
@logger.info "Successfully fetched #{url} (#{response.code})"
doc = Nokogiri::HTML(response.body)
data = extract_data(doc)
@logger.info "Extracted #{data.length} items from #{url}"
data
rescue => e
@logger.error "Failed to scrape #{url}: #{e.message}"
@logger.debug e.backtrace.join("\n")
[]
end
end
private
def fetch_page(url)
uri = URI(url)
http = Net::HTTP.new(uri.host, uri.port)
http.use_ssl = true if uri.scheme == 'https'
request = Net::HTTP::Get.new(uri)
request['User-Agent'] = 'Mozilla/5.0 (compatible; WebScraper)'
@logger.debug "Sending request to #{url}"
response = http.request(request)
unless response.is_a?(Net::HTTPSuccess)
@logger.warn "Non-success response: #{response.code} for #{url}"
end
response
end
def extract_data(doc)
# Your extraction logic here
[]
end
end
Advanced Logging with Multiple Outputs
For production scraping projects, you'll want to log to multiple destinations and use different log levels:
require 'logger'
class AdvancedScraper
def initialize
setup_logging
end
private
def setup_logging
# Console logger for development
@console_logger = Logger.new(STDOUT)
@console_logger.level = Logger::INFO
# File logger for persistent storage
@file_logger = Logger.new('logs/scraper.log', 'daily')
@file_logger.level = Logger::DEBUG
# Error-specific logger
@error_logger = Logger.new('logs/errors.log')
@error_logger.level = Logger::ERROR
# Custom formatter
formatter = proc do |severity, datetime, progname, msg|
"[#{datetime.strftime('%Y-%m-%d %H:%M:%S')}] #{severity.ljust(5)} #{progname}: #{msg}\n"
end
[@console_logger, @file_logger, @error_logger].each do |logger|
logger.formatter = formatter
end
end
def log_info(message, context = {})
formatted_msg = format_message(message, context)
@console_logger.info(formatted_msg)
@file_logger.info(formatted_msg)
end
def log_error(message, error = nil, context = {})
formatted_msg = format_message(message, context)
formatted_msg += "\nError: #{error.message}\n#{error.backtrace.join("\n")}" if error
@console_logger.error(formatted_msg)
@file_logger.error(formatted_msg)
@error_logger.error(formatted_msg)
end
def format_message(message, context = {})
context_str = context.empty? ? '' : " | Context: #{context.to_json}"
"#{message}#{context_str}"
end
end
Structured Logging with JSON
Structured logging makes it easier to parse and analyze logs programmatically. Here's how to implement JSON logging:
require 'json'
require 'logger'
class StructuredLogger
def initialize(output = STDOUT)
@logger = Logger.new(output)
@logger.formatter = proc do |severity, datetime, progname, msg|
log_entry = {
timestamp: datetime.iso8601,
level: severity,
message: msg.is_a?(String) ? msg : msg[:message],
**extract_context(msg)
}
"#{log_entry.to_json}\n"
end
end
def info(message, **context)
@logger.info(message: message, **context)
end
def error(message, error: nil, **context)
error_context = error ? {
error_class: error.class.name,
error_message: error.message,
backtrace: error.backtrace&.first(5)
} : {}
@logger.error(message: message, **context, **error_context)
end
def warn(message, **context)
@logger.warn(message: message, **context)
end
def debug(message, **context)
@logger.debug(message: message, **context)
end
private
def extract_context(msg)
return {} unless msg.is_a?(Hash)
msg.except(:message)
end
end
# Usage example
class ScraperWithStructuredLogging
def initialize
@logger = StructuredLogger.new(File.open('logs/scraper.json', 'a'))
end
def scrape_with_context(url, user_id: nil)
start_time = Time.now
@logger.info("Starting scrape",
url: url,
user_id: user_id,
scraper_version: "1.0.0"
)
begin
response = fetch_page(url)
duration = Time.now - start_time
@logger.info("Scrape completed successfully",
url: url,
response_code: response.code,
duration_seconds: duration.round(3),
response_size_bytes: response.body.length
)
rescue Net::TimeoutError => e
@logger.error("Scrape failed due to timeout",
error: e,
url: url,
duration_seconds: (Time.now - start_time).round(3)
)
rescue => e
@logger.error("Scrape failed with unexpected error",
error: e,
url: url,
duration_seconds: (Time.now - start_time).round(3)
)
end
end
end
Request and Response Logging
Detailed HTTP logging is crucial for debugging scraping issues:
require 'net/http'
class HTTPLogger
def initialize(logger)
@logger = logger
end
def log_request(request, uri)
@logger.debug("HTTP Request",
method: request.method,
url: uri.to_s,
headers: sanitize_headers(request.to_hash),
body_size: request.body&.length || 0
)
end
def log_response(response, uri, duration)
@logger.info("HTTP Response",
url: uri.to_s,
status_code: response.code,
status_message: response.message,
headers: response.to_hash,
body_size: response.body&.length || 0,
duration_ms: (duration * 1000).round(2)
)
end
def log_request_failure(uri, error, duration)
@logger.error("HTTP Request Failed",
url: uri.to_s,
error_class: error.class.name,
error_message: error.message,
duration_ms: (duration * 1000).round(2)
)
end
private
def sanitize_headers(headers)
# Remove sensitive headers
headers.except('authorization', 'cookie', 'x-api-key')
end
end
# Enhanced scraper with HTTP logging
class ScraperWithHTTPLogging
def initialize
@logger = StructuredLogger.new
@http_logger = HTTPLogger.new(@logger)
end
def fetch_page(url)
uri = URI(url)
request = Net::HTTP::Get.new(uri)
request['User-Agent'] = 'WebScraper/1.0'
@http_logger.log_request(request, uri)
start_time = Time.now
begin
response = Net::HTTP.start(uri.hostname, uri.port, use_ssl: uri.scheme == 'https') do |http|
http.request(request)
end
duration = Time.now - start_time
@http_logger.log_response(response, uri, duration)
response
rescue => e
duration = Time.now - start_time
@http_logger.log_request_failure(uri, e, duration)
raise
end
end
end
Performance and Metrics Logging
Track scraping performance to identify optimization opportunities:
class PerformanceLogger
def initialize(logger)
@logger = logger
@stats = {
requests_count: 0,
successful_requests: 0,
failed_requests: 0,
total_response_time: 0,
items_scraped: 0
}
end
def log_scraping_session_start(urls_count)
@session_start = Time.now
@logger.info("Scraping session started",
urls_to_scrape: urls_count,
session_id: generate_session_id
)
end
def log_scraping_session_end
duration = Time.now - @session_start
@logger.info("Scraping session completed",
total_duration_seconds: duration.round(2),
requests_made: @stats[:requests_count],
success_rate: calculate_success_rate,
average_response_time_ms: calculate_average_response_time,
items_per_second: (@stats[:items_scraped] / duration).round(2),
total_items_scraped: @stats[:items_scraped]
)
end
def log_page_scraped(url, success, response_time, items_count = 0)
@stats[:requests_count] += 1
@stats[:total_response_time] += response_time
@stats[:items_scraped] += items_count
if success
@stats[:successful_requests] += 1
else
@stats[:failed_requests] += 1
end
@logger.debug("Page scraping completed",
url: url,
success: success,
response_time_ms: (response_time * 1000).round(2),
items_extracted: items_count,
running_success_rate: calculate_success_rate
)
end
private
def calculate_success_rate
return 0 if @stats[:requests_count] == 0
(@stats[:successful_requests].to_f / @stats[:requests_count] * 100).round(2)
end
def calculate_average_response_time
return 0 if @stats[:requests_count] == 0
(@stats[:total_response_time] / @stats[:requests_count] * 1000).round(2)
end
def generate_session_id
Time.now.strftime('%Y%m%d_%H%M%S') + '_' + rand(1000).to_s.rjust(3, '0')
end
end
Rate Limiting and Retry Logging
Log rate limiting events and retry attempts to optimize scraping speed:
class RateLimitedScraper
def initialize
@logger = StructuredLogger.new
@performance_logger = PerformanceLogger.new(@logger)
end
def scrape_with_retries(url, max_retries: 3)
attempt = 1
begin
@logger.debug("Attempting to scrape",
url: url,
attempt: attempt,
max_retries: max_retries
)
response = fetch_with_rate_limiting(url)
@logger.info("Scrape successful", url: url, final_attempt: attempt)
response
rescue Net::HTTPTooManyRequests => e
if attempt <= max_retries
wait_time = calculate_backoff_time(attempt)
@logger.warn("Rate limited, retrying",
url: url,
attempt: attempt,
retry_after_seconds: wait_time,
rate_limit_headers: extract_rate_limit_headers(e.response)
)
sleep(wait_time)
attempt += 1
retry
else
@logger.error("Max retries exceeded due to rate limiting",
url: url,
total_attempts: attempt
)
raise
end
rescue => e
@logger.error("Scrape failed permanently",
error: e,
url: url,
total_attempts: attempt
)
raise
end
end
private
def calculate_backoff_time(attempt)
# Exponential backoff: 2^attempt seconds
2 ** attempt
end
def extract_rate_limit_headers(response)
{
retry_after: response['Retry-After'],
rate_limit_remaining: response['X-RateLimit-Remaining'],
rate_limit_reset: response['X-RateLimit-Reset']
}.compact
end
end
Configuration and Best Practices
Create a configurable logging system for different environments:
class LoggerConfig
def self.create_logger(environment = 'development')
case environment
when 'development'
create_development_logger
when 'production'
create_production_logger
when 'test'
create_test_logger
else
create_default_logger
end
end
private
def self.create_development_logger
logger = Logger.new(STDOUT)
logger.level = Logger::DEBUG
logger
end
def self.create_production_logger
# Log to file with rotation
logger = Logger.new('logs/production.log', 10, 1024000) # 10 files, 1MB each
logger.level = Logger::INFO
logger
end
def self.create_test_logger
# Silent logger for tests
logger = Logger.new('/dev/null')
logger.level = Logger::FATAL
logger
end
end
# Environment-specific configuration
ENV_LOGGER = LoggerConfig.create_logger(ENV['RAILS_ENV'] || 'development')
Integration with External Services
For production applications, consider integrating with external logging services:
# Example integration with external logging service
require 'net/http'
require 'json'
class ExternalLoggerAdapter
def initialize(api_key, endpoint)
@api_key = api_key
@endpoint = endpoint
@local_logger = Logger.new(STDOUT)
end
def log(level, message, context = {})
# Log locally first
@local_logger.send(level, message)
# Send to external service
Thread.new do
send_to_external_service(level, message, context)
end
end
private
def send_to_external_service(level, message, context)
payload = {
timestamp: Time.now.iso8601,
level: level.to_s.upcase,
message: message,
context: context,
service: 'web-scraper'
}
begin
uri = URI(@endpoint)
http = Net::HTTP.new(uri.host, uri.port)
http.use_ssl = true
request = Net::HTTP::Post.new(uri)
request['Authorization'] = "Bearer #{@api_key}"
request['Content-Type'] = 'application/json'
request.body = payload.to_json
http.request(request)
rescue => e
@local_logger.error("Failed to send log to external service: #{e.message}")
end
end
end
Summary
Implementing proper logging in Ruby web scraping projects involves:
- Multi-level logging with appropriate log levels for different scenarios
- Structured logging using JSON format for easier analysis
- Performance tracking to monitor scraping efficiency
- Error logging with full context and stack traces
- HTTP request/response logging for debugging network issues
- Rate limiting awareness with retry and backoff logging
Effective logging transforms web scraping from a black box operation into a transparent, debuggable, and maintainable process. Similar to how to handle errors in Puppeteer, proper error logging in Ruby helps you quickly identify and resolve issues before they impact your scraping operations.
Remember to regularly review your logs, set up alerting for critical errors, and adjust log levels based on your monitoring needs. When dealing with large-scale scraping operations, consider implementing log aggregation and analysis tools to gain insights into your scraping performance and reliability patterns.