How do I monitor and maintain Ruby web scraping applications in production?

Monitoring and maintaining Ruby web scraping applications in production requires a comprehensive approach that covers performance monitoring, error tracking, resource management, and proactive maintenance strategies. This guide provides essential practices and tools to ensure your Ruby scrapers run reliably and efficiently in production environments.

Core Monitoring Components

Application Performance Monitoring (APM)

Implementing robust APM is crucial for Ruby web scraping applications. Popular solutions include:

New Relic Integration:

# Gemfile
gem 'newrelic_rpm'

# config/newrelic.yml
production:
  license_key: 
  app_name: "Web Scraper Production"
  monitor_mode: true
  developer_mode: false

Custom Performance Tracking:

class ScrapingMonitor
  def self.track_performance(scraper_name)
    start_time = Time.current
    result = yield
    duration = Time.current - start_time

    Rails.logger.info "Scraper: #{scraper_name}, Duration: #{duration}s"

    # Send metrics to monitoring service
    StatsD.increment("scraper.#{scraper_name}.completed")
    StatsD.timing("scraper.#{scraper_name}.duration", duration * 1000)

    result
  rescue => e
    StatsD.increment("scraper.#{scraper_name}.failed")
    raise e
  end
end

# Usage in scraper
ScrapingMonitor.track_performance("product_scraper") do
  scrape_products
end

Health Check Endpoints

Create comprehensive health checks to monitor application status:

# config/routes.rb
Rails.application.routes.draw do
  get '/health', to: 'health#show'
  get '/health/detailed', to: 'health#detailed'
end

# app/controllers/health_controller.rb
class HealthController < ApplicationController
  def show
    render json: { status: 'ok', timestamp: Time.current }
  end

  def detailed
    checks = {
      database: database_healthy?,
      redis: redis_healthy?,
      sidekiq: sidekiq_healthy?,
      external_apis: external_apis_healthy?
    }

    status = checks.values.all? ? 'healthy' : 'unhealthy'

    render json: {
      status: status,
      checks: checks,
      timestamp: Time.current
    }, status: status == 'healthy' ? 200 : 503
  end

  private

  def database_healthy?
    ActiveRecord::Base.connection.execute('SELECT 1')
    true
  rescue
    false
  end

  def redis_healthy?
    Redis.current.ping == 'PONG'
  rescue
    false
  end

  def sidekiq_healthy?
    Sidekiq.redis { |conn| conn.ping } == 'PONG'
  rescue
    false
  end

  def external_apis_healthy?
    # Check critical external services
    response = Net::HTTP.get_response(URI('https://api.example.com/health'))
    response.code == '200'
  rescue
    false
  end
end

Error Tracking and Alerting

Comprehensive Error Handling

Implement structured error handling with proper logging and notifications:

class ScrapingService
  include Sidekiq::Worker

  def perform(url, options = {})
    @url = url
    @options = options

    validate_inputs!
    scrape_with_retries
  rescue ScrapingError => e
    handle_scraping_error(e)
  rescue StandardError => e
    handle_unexpected_error(e)
  end

  private

  def scrape_with_retries
    retries = 0
    max_retries = @options.fetch(:max_retries, 3)

    begin
      perform_scraping
    rescue Net::TimeoutError, Net::OpenTimeout => e
      retries += 1
      if retries <= max_retries
        delay = exponential_backoff(retries)
        Rails.logger.warn "Retrying #{@url} in #{delay}s (attempt #{retries}/#{max_retries})"
        sleep(delay)
        retry
      else
        raise ScrapingError.new("Max retries exceeded for #{@url}", original_error: e)
      end
    end
  end

  def handle_scraping_error(error)
    Rails.logger.error "Scraping failed: #{error.message}"

    # Send to error tracking service
    Sentry.capture_exception(error, extra: {
      url: @url,
      options: @options,
      worker_class: self.class.name
    })

    # Update failure metrics
    StatsD.increment('scraper.failures')

    # Notify if critical
    notify_on_critical_failure(error) if critical_url?(@url)
  end

  def exponential_backoff(attempt)
    [2 ** attempt, 60].min # Cap at 60 seconds
  end
end

Real-time Alerting System

Set up intelligent alerting for various failure scenarios:

class AlertingService
  ALERT_THRESHOLDS = {
    error_rate: 0.05,        # 5% error rate
    response_time: 30,       # 30 seconds
    queue_size: 1000,        # 1000 pending jobs
    memory_usage: 0.85       # 85% memory usage
  }.freeze

  def self.check_error_rates
    recent_errors = ScrapingJob.where(created_at: 10.minutes.ago..Time.current)
                              .where(status: 'failed').count
    total_jobs = ScrapingJob.where(created_at: 10.minutes.ago..Time.current).count

    if total_jobs > 0
      error_rate = recent_errors.to_f / total_jobs

      if error_rate > ALERT_THRESHOLDS[:error_rate]
        send_alert(
          severity: 'warning',
          message: "High error rate detected: #{(error_rate * 100).round(2)}%",
          details: {
            errors: recent_errors,
            total: total_jobs,
            period: '10 minutes'
          }
        )
      end
    end
  end

  def self.check_queue_health
    queue_sizes = Sidekiq::Queue.all.map { |q| [q.name, q.size] }.to_h

    queue_sizes.each do |queue_name, size|
      if size > ALERT_THRESHOLDS[:queue_size]
        send_alert(
          severity: 'critical',
          message: "Queue #{queue_name} is backed up",
          details: { queue_size: size, threshold: ALERT_THRESHOLDS[:queue_size] }
        )
      end
    end
  end

  private

  def self.send_alert(severity:, message:, details: {})
    # Send to Slack, PagerDuty, email, etc.
    SlackNotifier.ping(
      text: "[#{severity.upcase}] #{message}",
      attachments: [{ fields: details.map { |k, v| { title: k, value: v } } }]
    )
  end
end

Resource Management and Optimization

Memory and CPU Monitoring

Implement resource monitoring to prevent system overload:

class ResourceMonitor
  def self.monitor_system_resources
    memory_usage = get_memory_usage
    cpu_usage = get_cpu_usage

    Rails.logger.info "System Resources - Memory: #{memory_usage}%, CPU: #{cpu_usage}%"

    # Send metrics to monitoring system
    StatsD.gauge('system.memory_usage', memory_usage)
    StatsD.gauge('system.cpu_usage', cpu_usage)

    # Alert if thresholds exceeded
    if memory_usage > 85
      AlertingService.send_alert(
        severity: 'warning',
        message: "High memory usage: #{memory_usage}%"
      )
    end

    if cpu_usage > 90
      AlertingService.send_alert(
        severity: 'critical',
        message: "High CPU usage: #{cpu_usage}%"
      )
    end
  end

  private

  def self.get_memory_usage
    # Linux-specific, adjust for your OS
    total_mem = `grep MemTotal /proc/meminfo`.split[1].to_i
    available_mem = `grep MemAvailable /proc/meminfo`.split[1].to_i
    ((total_mem - available_mem).to_f / total_mem * 100).round(2)
  rescue
    0
  end

  def self.get_cpu_usage
    # Simple CPU usage calculation
    cpu_stats = File.read('/proc/stat').lines.first.split[1..4].map(&:to_i)
    idle = cpu_stats[3]
    total = cpu_stats.sum
    ((total - idle).to_f / total * 100).round(2)
  rescue
    0
  end
end

Connection Pool Management

Optimize HTTP connections for better performance:

class HttpClientManager
  CONNECTION_POOL_SIZE = 10
  KEEP_ALIVE_TIMEOUT = 30

  def self.http_client
    @http_client ||= HTTParty.base_uri('').tap do |client|
      client.default_options[:connection_adapter_options] = {
        pool_size: CONNECTION_POOL_SIZE,
        keep_alive_timeout: KEEP_ALIVE_TIMEOUT
      }

      client.default_options[:timeout] = 30
      client.default_options[:open_timeout] = 10
    end
  end

  def self.monitor_connections
    # Monitor connection pool usage
    active_connections = count_active_connections
    StatsD.gauge('http.active_connections', active_connections)

    if active_connections > CONNECTION_POOL_SIZE * 0.8
      Rails.logger.warn "High connection pool usage: #{active_connections}/#{CONNECTION_POOL_SIZE}"
    end
  end

  private

  def self.count_active_connections
    # Implementation depends on HTTP library used
    # This is a placeholder for actual connection counting
    0
  end
end

Data Quality and Validation

Automated Data Quality Checks

Implement checks to ensure scraped data quality:

class DataQualityMonitor
  QUALITY_THRESHOLDS = {
    completeness: 0.95,      # 95% of expected fields present
    freshness: 1.hour,       # Data should be less than 1 hour old
    volume_variance: 0.20    # ±20% volume variance allowed
  }.freeze

  def self.validate_scraped_data(dataset_name, data)
    results = {
      completeness: check_completeness(data),
      freshness: check_freshness(data),
      volume: check_volume_variance(dataset_name, data),
      duplicates: check_duplicates(data)
    }

    log_quality_metrics(dataset_name, results)
    alert_on_quality_issues(dataset_name, results)

    results
  end

  private

  def self.check_completeness(data)
    return 0 if data.empty?

    required_fields = %w[title price description url]
    complete_records = data.count do |record|
      required_fields.all? { |field| record[field].present? }
    end

    complete_records.to_f / data.length
  end

  def self.check_volume_variance(dataset_name, data)
    historical_avg = get_historical_average(dataset_name)
    return true if historical_avg.zero?

    current_volume = data.length
    variance = (current_volume - historical_avg).abs.to_f / historical_avg

    variance <= QUALITY_THRESHOLDS[:volume_variance]
  end

  def self.get_historical_average(dataset_name)
    # Calculate 7-day average volume
    ScrapingResult.where(dataset: dataset_name)
                  .where(created_at: 7.days.ago..1.day.ago)
                  .average(:record_count) || 0
  end
end

Proactive Maintenance Strategies

Automated Health Checks and Maintenance

class MaintenanceScheduler
  def self.daily_maintenance
    Rails.logger.info "Starting daily maintenance tasks"

    cleanup_old_logs
    optimize_database
    validate_external_dependencies
    update_scraping_targets_health
    generate_daily_report

    Rails.logger.info "Daily maintenance completed"
  end

  private

  def self.cleanup_old_logs
    # Remove logs older than 30 days
    old_logs = ScrapingLog.where('created_at < ?', 30.days.ago)
    deleted_count = old_logs.count
    old_logs.delete_all

    Rails.logger.info "Cleaned up #{deleted_count} old log entries"
  end

  def self.validate_external_dependencies
    dependencies = %w[
      https://api.example.com/health
      https://proxy-service.com/status
    ]

    dependencies.each do |url|
      begin
        response = Net::HTTP.get_response(URI(url))
        status = response.code == '200' ? 'healthy' : 'unhealthy'
        Rails.logger.info "Dependency #{url}: #{status}"

        StatsD.gauge("dependencies.#{extract_service_name(url)}.status", 
                     status == 'healthy' ? 1 : 0)
      rescue => e
        Rails.logger.error "Dependency check failed for #{url}: #{e.message}"
        StatsD.gauge("dependencies.#{extract_service_name(url)}.status", 0)
      end
    end
  end

  def self.extract_service_name(url)
    URI(url).host.gsub(/[^a-zA-Z0-9]/, '_')
  end
end

# Schedule with whenever gem or cron
# config/schedule.rb
every 1.day, at: '2:00 am' do
  runner "MaintenanceScheduler.daily_maintenance"
end

every 5.minutes do
  runner "AlertingService.check_error_rates"
  runner "AlertingService.check_queue_health"
end

Deployment and Rollback Strategies

Implement safe deployment practices with monitoring:

class DeploymentMonitor
  def self.post_deployment_checks
    Rails.logger.info "Running post-deployment health checks"

    checks = {
      database_migrations: check_pending_migrations,
      critical_scrapers: test_critical_scrapers,
      external_apis: test_external_api_connectivity,
      background_jobs: check_background_job_processing
    }

    if checks.values.all?
      Rails.logger.info "All post-deployment checks passed"
      StatsD.increment('deployment.success')
    else
      Rails.logger.error "Post-deployment checks failed: #{checks}"
      StatsD.increment('deployment.failure')

      # Consider automated rollback
      trigger_rollback_alert(checks)
    end

    checks
  end

  private

  def self.test_critical_scrapers
    critical_scrapers = %w[ProductScraper UserScraper]

    critical_scrapers.all? do |scraper_class|
      begin
        # Run a lightweight test scrape
        scraper_class.constantize.new.test_scrape
        true
      rescue => e
        Rails.logger.error "Critical scraper #{scraper_class} failed: #{e.message}"
        false
      end
    end
  end
end

Command Line Monitoring Tools

Essential Commands for Production Monitoring

Monitor your Ruby application's health using these essential commands:

# Monitor Sidekiq queue status
bundle exec sidekiq-ctl status

# Check Rails application logs
tail -f log/production.log

# Monitor system resources
htop
iostat -x 1

# Check database connections
bundle exec rails runner "puts ActiveRecord::Base.connection_pool.stat"

# Monitor memory usage of Ruby processes
ps aux | grep ruby | awk '{print $6/1024 " MB " $11}'

# Check Sidekiq queue sizes
bundle exec rails runner "
  require 'sidekiq/api'
  Sidekiq::Queue.all.each { |q| puts \"#{q.name}: #{q.size}\" }
"

Best Practices Summary

Implement comprehensive monitoring covering performance, errors, and resource usage
Set up intelligent alerting with appropriate thresholds and escalation paths
Monitor data quality to ensure scraped data meets business requirements
Use connection pooling and resource management for optimal performance
Implement automated maintenance tasks for proactive system health
Plan for graceful degradation when external services are unavailable
Regular testing of critical scraping workflows in production-like environments

For more advanced monitoring techniques, consider exploring how to handle timeouts in Puppeteer for JavaScript-based scrapers or how to handle errors in Puppeteer for comprehensive error handling strategies.

By implementing these monitoring and maintenance practices, your Ruby web scraping applications will run more reliably in production, with faster incident detection and resolution times.

Table of contents