How do I monitor and maintain Ruby web scraping applications in production?
Monitoring and maintaining Ruby web scraping applications in production requires a comprehensive approach that covers performance monitoring, error tracking, resource management, and proactive maintenance strategies. This guide provides essential practices and tools to ensure your Ruby scrapers run reliably and efficiently in production environments.
Core Monitoring Components
Application Performance Monitoring (APM)
Implementing robust APM is crucial for Ruby web scraping applications. Popular solutions include:
New Relic Integration:
# Gemfile
gem 'newrelic_rpm'
# config/newrelic.yml
production:
license_key:
app_name: "Web Scraper Production"
monitor_mode: true
developer_mode: false
Custom Performance Tracking:
class ScrapingMonitor
def self.track_performance(scraper_name)
start_time = Time.current
result = yield
duration = Time.current - start_time
Rails.logger.info "Scraper: #{scraper_name}, Duration: #{duration}s"
# Send metrics to monitoring service
StatsD.increment("scraper.#{scraper_name}.completed")
StatsD.timing("scraper.#{scraper_name}.duration", duration * 1000)
result
rescue => e
StatsD.increment("scraper.#{scraper_name}.failed")
raise e
end
end
# Usage in scraper
ScrapingMonitor.track_performance("product_scraper") do
scrape_products
end
Health Check Endpoints
Create comprehensive health checks to monitor application status:
# config/routes.rb
Rails.application.routes.draw do
get '/health', to: 'health#show'
get '/health/detailed', to: 'health#detailed'
end
# app/controllers/health_controller.rb
class HealthController < ApplicationController
def show
render json: { status: 'ok', timestamp: Time.current }
end
def detailed
checks = {
database: database_healthy?,
redis: redis_healthy?,
sidekiq: sidekiq_healthy?,
external_apis: external_apis_healthy?
}
status = checks.values.all? ? 'healthy' : 'unhealthy'
render json: {
status: status,
checks: checks,
timestamp: Time.current
}, status: status == 'healthy' ? 200 : 503
end
private
def database_healthy?
ActiveRecord::Base.connection.execute('SELECT 1')
true
rescue
false
end
def redis_healthy?
Redis.current.ping == 'PONG'
rescue
false
end
def sidekiq_healthy?
Sidekiq.redis { |conn| conn.ping } == 'PONG'
rescue
false
end
def external_apis_healthy?
# Check critical external services
response = Net::HTTP.get_response(URI('https://api.example.com/health'))
response.code == '200'
rescue
false
end
end
Error Tracking and Alerting
Comprehensive Error Handling
Implement structured error handling with proper logging and notifications:
class ScrapingService
include Sidekiq::Worker
def perform(url, options = {})
@url = url
@options = options
validate_inputs!
scrape_with_retries
rescue ScrapingError => e
handle_scraping_error(e)
rescue StandardError => e
handle_unexpected_error(e)
end
private
def scrape_with_retries
retries = 0
max_retries = @options.fetch(:max_retries, 3)
begin
perform_scraping
rescue Net::TimeoutError, Net::OpenTimeout => e
retries += 1
if retries <= max_retries
delay = exponential_backoff(retries)
Rails.logger.warn "Retrying #{@url} in #{delay}s (attempt #{retries}/#{max_retries})"
sleep(delay)
retry
else
raise ScrapingError.new("Max retries exceeded for #{@url}", original_error: e)
end
end
end
def handle_scraping_error(error)
Rails.logger.error "Scraping failed: #{error.message}"
# Send to error tracking service
Sentry.capture_exception(error, extra: {
url: @url,
options: @options,
worker_class: self.class.name
})
# Update failure metrics
StatsD.increment('scraper.failures')
# Notify if critical
notify_on_critical_failure(error) if critical_url?(@url)
end
def exponential_backoff(attempt)
[2 ** attempt, 60].min # Cap at 60 seconds
end
end
Real-time Alerting System
Set up intelligent alerting for various failure scenarios:
class AlertingService
ALERT_THRESHOLDS = {
error_rate: 0.05, # 5% error rate
response_time: 30, # 30 seconds
queue_size: 1000, # 1000 pending jobs
memory_usage: 0.85 # 85% memory usage
}.freeze
def self.check_error_rates
recent_errors = ScrapingJob.where(created_at: 10.minutes.ago..Time.current)
.where(status: 'failed').count
total_jobs = ScrapingJob.where(created_at: 10.minutes.ago..Time.current).count
if total_jobs > 0
error_rate = recent_errors.to_f / total_jobs
if error_rate > ALERT_THRESHOLDS[:error_rate]
send_alert(
severity: 'warning',
message: "High error rate detected: #{(error_rate * 100).round(2)}%",
details: {
errors: recent_errors,
total: total_jobs,
period: '10 minutes'
}
)
end
end
end
def self.check_queue_health
queue_sizes = Sidekiq::Queue.all.map { |q| [q.name, q.size] }.to_h
queue_sizes.each do |queue_name, size|
if size > ALERT_THRESHOLDS[:queue_size]
send_alert(
severity: 'critical',
message: "Queue #{queue_name} is backed up",
details: { queue_size: size, threshold: ALERT_THRESHOLDS[:queue_size] }
)
end
end
end
private
def self.send_alert(severity:, message:, details: {})
# Send to Slack, PagerDuty, email, etc.
SlackNotifier.ping(
text: "[#{severity.upcase}] #{message}",
attachments: [{ fields: details.map { |k, v| { title: k, value: v } } }]
)
end
end
Resource Management and Optimization
Memory and CPU Monitoring
Implement resource monitoring to prevent system overload:
class ResourceMonitor
def self.monitor_system_resources
memory_usage = get_memory_usage
cpu_usage = get_cpu_usage
Rails.logger.info "System Resources - Memory: #{memory_usage}%, CPU: #{cpu_usage}%"
# Send metrics to monitoring system
StatsD.gauge('system.memory_usage', memory_usage)
StatsD.gauge('system.cpu_usage', cpu_usage)
# Alert if thresholds exceeded
if memory_usage > 85
AlertingService.send_alert(
severity: 'warning',
message: "High memory usage: #{memory_usage}%"
)
end
if cpu_usage > 90
AlertingService.send_alert(
severity: 'critical',
message: "High CPU usage: #{cpu_usage}%"
)
end
end
private
def self.get_memory_usage
# Linux-specific, adjust for your OS
total_mem = `grep MemTotal /proc/meminfo`.split[1].to_i
available_mem = `grep MemAvailable /proc/meminfo`.split[1].to_i
((total_mem - available_mem).to_f / total_mem * 100).round(2)
rescue
0
end
def self.get_cpu_usage
# Simple CPU usage calculation
cpu_stats = File.read('/proc/stat').lines.first.split[1..4].map(&:to_i)
idle = cpu_stats[3]
total = cpu_stats.sum
((total - idle).to_f / total * 100).round(2)
rescue
0
end
end
Connection Pool Management
Optimize HTTP connections for better performance:
class HttpClientManager
CONNECTION_POOL_SIZE = 10
KEEP_ALIVE_TIMEOUT = 30
def self.http_client
@http_client ||= HTTParty.base_uri('').tap do |client|
client.default_options[:connection_adapter_options] = {
pool_size: CONNECTION_POOL_SIZE,
keep_alive_timeout: KEEP_ALIVE_TIMEOUT
}
client.default_options[:timeout] = 30
client.default_options[:open_timeout] = 10
end
end
def self.monitor_connections
# Monitor connection pool usage
active_connections = count_active_connections
StatsD.gauge('http.active_connections', active_connections)
if active_connections > CONNECTION_POOL_SIZE * 0.8
Rails.logger.warn "High connection pool usage: #{active_connections}/#{CONNECTION_POOL_SIZE}"
end
end
private
def self.count_active_connections
# Implementation depends on HTTP library used
# This is a placeholder for actual connection counting
0
end
end
Data Quality and Validation
Automated Data Quality Checks
Implement checks to ensure scraped data quality:
class DataQualityMonitor
QUALITY_THRESHOLDS = {
completeness: 0.95, # 95% of expected fields present
freshness: 1.hour, # Data should be less than 1 hour old
volume_variance: 0.20 # ±20% volume variance allowed
}.freeze
def self.validate_scraped_data(dataset_name, data)
results = {
completeness: check_completeness(data),
freshness: check_freshness(data),
volume: check_volume_variance(dataset_name, data),
duplicates: check_duplicates(data)
}
log_quality_metrics(dataset_name, results)
alert_on_quality_issues(dataset_name, results)
results
end
private
def self.check_completeness(data)
return 0 if data.empty?
required_fields = %w[title price description url]
complete_records = data.count do |record|
required_fields.all? { |field| record[field].present? }
end
complete_records.to_f / data.length
end
def self.check_volume_variance(dataset_name, data)
historical_avg = get_historical_average(dataset_name)
return true if historical_avg.zero?
current_volume = data.length
variance = (current_volume - historical_avg).abs.to_f / historical_avg
variance <= QUALITY_THRESHOLDS[:volume_variance]
end
def self.get_historical_average(dataset_name)
# Calculate 7-day average volume
ScrapingResult.where(dataset: dataset_name)
.where(created_at: 7.days.ago..1.day.ago)
.average(:record_count) || 0
end
end
Proactive Maintenance Strategies
Automated Health Checks and Maintenance
class MaintenanceScheduler
def self.daily_maintenance
Rails.logger.info "Starting daily maintenance tasks"
cleanup_old_logs
optimize_database
validate_external_dependencies
update_scraping_targets_health
generate_daily_report
Rails.logger.info "Daily maintenance completed"
end
private
def self.cleanup_old_logs
# Remove logs older than 30 days
old_logs = ScrapingLog.where('created_at < ?', 30.days.ago)
deleted_count = old_logs.count
old_logs.delete_all
Rails.logger.info "Cleaned up #{deleted_count} old log entries"
end
def self.validate_external_dependencies
dependencies = %w[
https://api.example.com/health
https://proxy-service.com/status
]
dependencies.each do |url|
begin
response = Net::HTTP.get_response(URI(url))
status = response.code == '200' ? 'healthy' : 'unhealthy'
Rails.logger.info "Dependency #{url}: #{status}"
StatsD.gauge("dependencies.#{extract_service_name(url)}.status",
status == 'healthy' ? 1 : 0)
rescue => e
Rails.logger.error "Dependency check failed for #{url}: #{e.message}"
StatsD.gauge("dependencies.#{extract_service_name(url)}.status", 0)
end
end
end
def self.extract_service_name(url)
URI(url).host.gsub(/[^a-zA-Z0-9]/, '_')
end
end
# Schedule with whenever gem or cron
# config/schedule.rb
every 1.day, at: '2:00 am' do
runner "MaintenanceScheduler.daily_maintenance"
end
every 5.minutes do
runner "AlertingService.check_error_rates"
runner "AlertingService.check_queue_health"
end
Deployment and Rollback Strategies
Implement safe deployment practices with monitoring:
class DeploymentMonitor
def self.post_deployment_checks
Rails.logger.info "Running post-deployment health checks"
checks = {
database_migrations: check_pending_migrations,
critical_scrapers: test_critical_scrapers,
external_apis: test_external_api_connectivity,
background_jobs: check_background_job_processing
}
if checks.values.all?
Rails.logger.info "All post-deployment checks passed"
StatsD.increment('deployment.success')
else
Rails.logger.error "Post-deployment checks failed: #{checks}"
StatsD.increment('deployment.failure')
# Consider automated rollback
trigger_rollback_alert(checks)
end
checks
end
private
def self.test_critical_scrapers
critical_scrapers = %w[ProductScraper UserScraper]
critical_scrapers.all? do |scraper_class|
begin
# Run a lightweight test scrape
scraper_class.constantize.new.test_scrape
true
rescue => e
Rails.logger.error "Critical scraper #{scraper_class} failed: #{e.message}"
false
end
end
end
end
Command Line Monitoring Tools
Essential Commands for Production Monitoring
Monitor your Ruby application's health using these essential commands:
# Monitor Sidekiq queue status
bundle exec sidekiq-ctl status
# Check Rails application logs
tail -f log/production.log
# Monitor system resources
htop
iostat -x 1
# Check database connections
bundle exec rails runner "puts ActiveRecord::Base.connection_pool.stat"
# Monitor memory usage of Ruby processes
ps aux | grep ruby | awk '{print $6/1024 " MB " $11}'
# Check Sidekiq queue sizes
bundle exec rails runner "
require 'sidekiq/api'
Sidekiq::Queue.all.each { |q| puts \"#{q.name}: #{q.size}\" }
"
Best Practices Summary
- Implement comprehensive monitoring covering performance, errors, and resource usage
- Set up intelligent alerting with appropriate thresholds and escalation paths
- Monitor data quality to ensure scraped data meets business requirements
- Use connection pooling and resource management for optimal performance
- Implement automated maintenance tasks for proactive system health
- Plan for graceful degradation when external services are unavailable
- Regular testing of critical scraping workflows in production-like environments
For more advanced monitoring techniques, consider exploring how to handle timeouts in Puppeteer for JavaScript-based scrapers or how to handle errors in Puppeteer for comprehensive error handling strategies.
By implementing these monitoring and maintenance practices, your Ruby web scraping applications will run more reliably in production, with faster incident detection and resolution times.