Table of contents

How do I implement proper logging for Ruby web scraping projects?

Proper logging is essential for Ruby web scraping projects to monitor performance, debug issues, track success rates, and maintain reliable scraping operations. This guide covers comprehensive logging strategies, from basic setup to advanced structured logging patterns.

Why Logging Matters in Web Scraping

Web scraping involves numerous potential failure points: network timeouts, rate limiting, HTML structure changes, and anti-bot measures. Effective logging helps you:

  • Debug scraping failures and understand why requests fail
  • Monitor scraping performance and identify bottlenecks
  • Track success rates and data quality metrics
  • Comply with rate limits and avoid getting blocked
  • Maintain audit trails for compliance and debugging

Basic Logging Setup with Ruby's Logger

Ruby's built-in Logger class provides a solid foundation for web scraping projects:

require 'logger'
require 'net/http'
require 'nokogiri'

class WebScraper
  def initialize
    @logger = Logger.new(STDOUT)
    @logger.level = Logger::INFO
    @logger.formatter = proc do |severity, datetime, progname, msg|
      "[#{datetime}] #{severity}: #{msg}\n"
    end
  end

  def scrape_page(url)
    @logger.info "Starting scrape for #{url}"

    begin
      response = fetch_page(url)
      @logger.info "Successfully fetched #{url} (#{response.code})"

      doc = Nokogiri::HTML(response.body)
      data = extract_data(doc)

      @logger.info "Extracted #{data.length} items from #{url}"
      data
    rescue => e
      @logger.error "Failed to scrape #{url}: #{e.message}"
      @logger.debug e.backtrace.join("\n")
      []
    end
  end

  private

  def fetch_page(url)
    uri = URI(url)
    http = Net::HTTP.new(uri.host, uri.port)
    http.use_ssl = true if uri.scheme == 'https'

    request = Net::HTTP::Get.new(uri)
    request['User-Agent'] = 'Mozilla/5.0 (compatible; WebScraper)'

    @logger.debug "Sending request to #{url}"
    response = http.request(request)

    unless response.is_a?(Net::HTTPSuccess)
      @logger.warn "Non-success response: #{response.code} for #{url}"
    end

    response
  end

  def extract_data(doc)
    # Your extraction logic here
    []
  end
end

Advanced Logging with Multiple Outputs

For production scraping projects, you'll want to log to multiple destinations and use different log levels:

require 'logger'

class AdvancedScraper
  def initialize
    setup_logging
  end

  private

  def setup_logging
    # Console logger for development
    @console_logger = Logger.new(STDOUT)
    @console_logger.level = Logger::INFO

    # File logger for persistent storage
    @file_logger = Logger.new('logs/scraper.log', 'daily')
    @file_logger.level = Logger::DEBUG

    # Error-specific logger
    @error_logger = Logger.new('logs/errors.log')
    @error_logger.level = Logger::ERROR

    # Custom formatter
    formatter = proc do |severity, datetime, progname, msg|
      "[#{datetime.strftime('%Y-%m-%d %H:%M:%S')}] #{severity.ljust(5)} #{progname}: #{msg}\n"
    end

    [@console_logger, @file_logger, @error_logger].each do |logger|
      logger.formatter = formatter
    end
  end

  def log_info(message, context = {})
    formatted_msg = format_message(message, context)
    @console_logger.info(formatted_msg)
    @file_logger.info(formatted_msg)
  end

  def log_error(message, error = nil, context = {})
    formatted_msg = format_message(message, context)
    formatted_msg += "\nError: #{error.message}\n#{error.backtrace.join("\n")}" if error

    @console_logger.error(formatted_msg)
    @file_logger.error(formatted_msg)
    @error_logger.error(formatted_msg)
  end

  def format_message(message, context = {})
    context_str = context.empty? ? '' : " | Context: #{context.to_json}"
    "#{message}#{context_str}"
  end
end

Structured Logging with JSON

Structured logging makes it easier to parse and analyze logs programmatically. Here's how to implement JSON logging:

require 'json'
require 'logger'

class StructuredLogger
  def initialize(output = STDOUT)
    @logger = Logger.new(output)
    @logger.formatter = proc do |severity, datetime, progname, msg|
      log_entry = {
        timestamp: datetime.iso8601,
        level: severity,
        message: msg.is_a?(String) ? msg : msg[:message],
        **extract_context(msg)
      }
      "#{log_entry.to_json}\n"
    end
  end

  def info(message, **context)
    @logger.info(message: message, **context)
  end

  def error(message, error: nil, **context)
    error_context = error ? {
      error_class: error.class.name,
      error_message: error.message,
      backtrace: error.backtrace&.first(5)
    } : {}

    @logger.error(message: message, **context, **error_context)
  end

  def warn(message, **context)
    @logger.warn(message: message, **context)
  end

  def debug(message, **context)
    @logger.debug(message: message, **context)
  end

  private

  def extract_context(msg)
    return {} unless msg.is_a?(Hash)
    msg.except(:message)
  end
end

# Usage example
class ScraperWithStructuredLogging
  def initialize
    @logger = StructuredLogger.new(File.open('logs/scraper.json', 'a'))
  end

  def scrape_with_context(url, user_id: nil)
    start_time = Time.now

    @logger.info("Starting scrape",
      url: url,
      user_id: user_id,
      scraper_version: "1.0.0"
    )

    begin
      response = fetch_page(url)
      duration = Time.now - start_time

      @logger.info("Scrape completed successfully",
        url: url,
        response_code: response.code,
        duration_seconds: duration.round(3),
        response_size_bytes: response.body.length
      )

    rescue Net::TimeoutError => e
      @logger.error("Scrape failed due to timeout",
        error: e,
        url: url,
        duration_seconds: (Time.now - start_time).round(3)
      )
    rescue => e
      @logger.error("Scrape failed with unexpected error",
        error: e,
        url: url,
        duration_seconds: (Time.now - start_time).round(3)
      )
    end
  end
end

Request and Response Logging

Detailed HTTP logging is crucial for debugging scraping issues:

require 'net/http'

class HTTPLogger
  def initialize(logger)
    @logger = logger
  end

  def log_request(request, uri)
    @logger.debug("HTTP Request",
      method: request.method,
      url: uri.to_s,
      headers: sanitize_headers(request.to_hash),
      body_size: request.body&.length || 0
    )
  end

  def log_response(response, uri, duration)
    @logger.info("HTTP Response",
      url: uri.to_s,
      status_code: response.code,
      status_message: response.message,
      headers: response.to_hash,
      body_size: response.body&.length || 0,
      duration_ms: (duration * 1000).round(2)
    )
  end

  def log_request_failure(uri, error, duration)
    @logger.error("HTTP Request Failed",
      url: uri.to_s,
      error_class: error.class.name,
      error_message: error.message,
      duration_ms: (duration * 1000).round(2)
    )
  end

  private

  def sanitize_headers(headers)
    # Remove sensitive headers
    headers.except('authorization', 'cookie', 'x-api-key')
  end
end

# Enhanced scraper with HTTP logging
class ScraperWithHTTPLogging
  def initialize
    @logger = StructuredLogger.new
    @http_logger = HTTPLogger.new(@logger)
  end

  def fetch_page(url)
    uri = URI(url)
    request = Net::HTTP::Get.new(uri)
    request['User-Agent'] = 'WebScraper/1.0'

    @http_logger.log_request(request, uri)

    start_time = Time.now
    begin
      response = Net::HTTP.start(uri.hostname, uri.port, use_ssl: uri.scheme == 'https') do |http|
        http.request(request)
      end

      duration = Time.now - start_time
      @http_logger.log_response(response, uri, duration)
      response

    rescue => e
      duration = Time.now - start_time
      @http_logger.log_request_failure(uri, e, duration)
      raise
    end
  end
end

Performance and Metrics Logging

Track scraping performance to identify optimization opportunities:

class PerformanceLogger
  def initialize(logger)
    @logger = logger
    @stats = {
      requests_count: 0,
      successful_requests: 0,
      failed_requests: 0,
      total_response_time: 0,
      items_scraped: 0
    }
  end

  def log_scraping_session_start(urls_count)
    @session_start = Time.now
    @logger.info("Scraping session started",
      urls_to_scrape: urls_count,
      session_id: generate_session_id
    )
  end

  def log_scraping_session_end
    duration = Time.now - @session_start

    @logger.info("Scraping session completed",
      total_duration_seconds: duration.round(2),
      requests_made: @stats[:requests_count],
      success_rate: calculate_success_rate,
      average_response_time_ms: calculate_average_response_time,
      items_per_second: (@stats[:items_scraped] / duration).round(2),
      total_items_scraped: @stats[:items_scraped]
    )
  end

  def log_page_scraped(url, success, response_time, items_count = 0)
    @stats[:requests_count] += 1
    @stats[:total_response_time] += response_time
    @stats[:items_scraped] += items_count

    if success
      @stats[:successful_requests] += 1
    else
      @stats[:failed_requests] += 1
    end

    @logger.debug("Page scraping completed",
      url: url,
      success: success,
      response_time_ms: (response_time * 1000).round(2),
      items_extracted: items_count,
      running_success_rate: calculate_success_rate
    )
  end

  private

  def calculate_success_rate
    return 0 if @stats[:requests_count] == 0
    (@stats[:successful_requests].to_f / @stats[:requests_count] * 100).round(2)
  end

  def calculate_average_response_time
    return 0 if @stats[:requests_count] == 0
    (@stats[:total_response_time] / @stats[:requests_count] * 1000).round(2)
  end

  def generate_session_id
    Time.now.strftime('%Y%m%d_%H%M%S') + '_' + rand(1000).to_s.rjust(3, '0')
  end
end

Rate Limiting and Retry Logging

Log rate limiting events and retry attempts to optimize scraping speed:

class RateLimitedScraper
  def initialize
    @logger = StructuredLogger.new
    @performance_logger = PerformanceLogger.new(@logger)
  end

  def scrape_with_retries(url, max_retries: 3)
    attempt = 1

    begin
      @logger.debug("Attempting to scrape",
        url: url,
        attempt: attempt,
        max_retries: max_retries
      )

      response = fetch_with_rate_limiting(url)
      @logger.info("Scrape successful", url: url, final_attempt: attempt)
      response

    rescue Net::HTTPTooManyRequests => e
      if attempt <= max_retries
        wait_time = calculate_backoff_time(attempt)
        @logger.warn("Rate limited, retrying",
          url: url,
          attempt: attempt,
          retry_after_seconds: wait_time,
          rate_limit_headers: extract_rate_limit_headers(e.response)
        )

        sleep(wait_time)
        attempt += 1
        retry
      else
        @logger.error("Max retries exceeded due to rate limiting",
          url: url,
          total_attempts: attempt
        )
        raise
      end
    rescue => e
      @logger.error("Scrape failed permanently",
        error: e,
        url: url,
        total_attempts: attempt
      )
      raise
    end
  end

  private

  def calculate_backoff_time(attempt)
    # Exponential backoff: 2^attempt seconds
    2 ** attempt
  end

  def extract_rate_limit_headers(response)
    {
      retry_after: response['Retry-After'],
      rate_limit_remaining: response['X-RateLimit-Remaining'],
      rate_limit_reset: response['X-RateLimit-Reset']
    }.compact
  end
end

Configuration and Best Practices

Create a configurable logging system for different environments:

class LoggerConfig
  def self.create_logger(environment = 'development')
    case environment
    when 'development'
      create_development_logger
    when 'production'
      create_production_logger
    when 'test'
      create_test_logger
    else
      create_default_logger
    end
  end

  private

  def self.create_development_logger
    logger = Logger.new(STDOUT)
    logger.level = Logger::DEBUG
    logger
  end

  def self.create_production_logger
    # Log to file with rotation
    logger = Logger.new('logs/production.log', 10, 1024000) # 10 files, 1MB each
    logger.level = Logger::INFO
    logger
  end

  def self.create_test_logger
    # Silent logger for tests
    logger = Logger.new('/dev/null')
    logger.level = Logger::FATAL
    logger
  end
end

# Environment-specific configuration
ENV_LOGGER = LoggerConfig.create_logger(ENV['RAILS_ENV'] || 'development')

Integration with External Services

For production applications, consider integrating with external logging services:

# Example integration with external logging service
require 'net/http'
require 'json'

class ExternalLoggerAdapter
  def initialize(api_key, endpoint)
    @api_key = api_key
    @endpoint = endpoint
    @local_logger = Logger.new(STDOUT)
  end

  def log(level, message, context = {})
    # Log locally first
    @local_logger.send(level, message)

    # Send to external service
    Thread.new do
      send_to_external_service(level, message, context)
    end
  end

  private

  def send_to_external_service(level, message, context)
    payload = {
      timestamp: Time.now.iso8601,
      level: level.to_s.upcase,
      message: message,
      context: context,
      service: 'web-scraper'
    }

    begin
      uri = URI(@endpoint)
      http = Net::HTTP.new(uri.host, uri.port)
      http.use_ssl = true

      request = Net::HTTP::Post.new(uri)
      request['Authorization'] = "Bearer #{@api_key}"
      request['Content-Type'] = 'application/json'
      request.body = payload.to_json

      http.request(request)
    rescue => e
      @local_logger.error("Failed to send log to external service: #{e.message}")
    end
  end
end

Summary

Implementing proper logging in Ruby web scraping projects involves:

  1. Multi-level logging with appropriate log levels for different scenarios
  2. Structured logging using JSON format for easier analysis
  3. Performance tracking to monitor scraping efficiency
  4. Error logging with full context and stack traces
  5. HTTP request/response logging for debugging network issues
  6. Rate limiting awareness with retry and backoff logging

Effective logging transforms web scraping from a black box operation into a transparent, debuggable, and maintainable process. Similar to how to handle errors in Puppeteer, proper error logging in Ruby helps you quickly identify and resolve issues before they impact your scraping operations.

Remember to regularly review your logs, set up alerting for critical errors, and adjust log levels based on your monitoring needs. When dealing with large-scale scraping operations, consider implementing log aggregation and analysis tools to gain insights into your scraping performance and reliability patterns.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon