Table of contents

What is the Best Way to Handle Errors and Exceptions in Ruby Web Scraping?

Error handling is a critical aspect of robust Ruby web scraping applications. Without proper exception management, your scrapers can fail silently, crash unexpectedly, or provide unreliable results. This guide covers comprehensive error handling strategies for Ruby web scraping using popular libraries like HTTParty, Net::HTTP, and Nokogiri.

Common Types of Errors in Ruby Web Scraping

Network-Related Errors

Network issues are the most common problems in web scraping:

begin
  response = HTTParty.get('https://example.com')
rescue Net::TimeoutError => e
  puts "Request timed out: #{e.message}"
rescue Net::OpenTimeout => e
  puts "Connection timeout: #{e.message}"
rescue Net::ReadTimeout => e
  puts "Read timeout: #{e.message}"
rescue SocketError => e
  puts "Network error: #{e.message}"
rescue Errno::ECONNREFUSED => e
  puts "Connection refused: #{e.message}"
end

HTTP Status Code Errors

Handle various HTTP response codes appropriately:

require 'httparty'

class WebScraper
  def self.fetch_page(url)
    response = HTTParty.get(url)

    case response.code
    when 200
      response.body
    when 404
      raise StandardError, "Page not found: #{url}"
    when 403
      raise StandardError, "Access forbidden: #{url}"
    when 429
      raise StandardError, "Rate limited. Please retry later."
    when 500..599
      raise StandardError, "Server error (#{response.code}): #{url}"
    else
      raise StandardError, "Unexpected response code: #{response.code}"
    end
  rescue HTTParty::Error => e
    raise StandardError, "HTTParty error: #{e.message}"
  end
end

Parsing and Data Extraction Errors

Handle errors when parsing HTML or extracting data:

require 'nokogiri'

def safe_parse_html(html_content)
  begin
    doc = Nokogiri::HTML(html_content)

    # Safe element selection with error handling
    title = doc.css('title').first&.text || 'No title found'

    # Handle missing elements gracefully
    price = doc.css('.price').first&.text&.strip
    if price.nil? || price.empty?
      puts "Warning: Price not found on page"
      price = "N/A"
    end

    {
      title: title,
      price: price
    }
  rescue Nokogiri::XML::SyntaxError => e
    puts "HTML parsing error: #{e.message}"
    return nil
  rescue StandardError => e
    puts "Unexpected parsing error: #{e.message}"
    return nil
  end
end

Implementing Retry Logic

Basic Retry Mechanism

Implement exponential backoff for transient failures:

def fetch_with_retry(url, max_retries: 3, initial_delay: 1)
  retries = 0

  begin
    response = HTTParty.get(url, timeout: 30)

    if response.success?
      return response.body
    else
      raise StandardError, "HTTP #{response.code}"
    end

  rescue Net::TimeoutError, Net::OpenTimeout, Net::ReadTimeout, 
         SocketError, Errno::ECONNREFUSED => e
    retries += 1

    if retries <= max_retries
      delay = initial_delay * (2 ** (retries - 1))
      puts "Retry #{retries}/#{max_retries} after #{delay}s delay. Error: #{e.message}"
      sleep(delay)
      retry
    else
      puts "Max retries exceeded. Final error: #{e.message}"
      raise e
    end
  end
end

Advanced Retry with Different Strategies

class RetryHandler
  RETRYABLE_ERRORS = [
    Net::TimeoutError,
    Net::OpenTimeout,
    Net::ReadTimeout,
    SocketError,
    Errno::ECONNREFUSED,
    HTTParty::Error
  ].freeze

  def self.with_retry(max_retries: 3, backoff: :exponential)
    retries = 0

    begin
      yield
    rescue *RETRYABLE_ERRORS => e
      retries += 1

      if retries <= max_retries
        delay = calculate_delay(retries, backoff)
        puts "Attempt #{retries}/#{max_retries} failed: #{e.message}"
        puts "Retrying in #{delay} seconds..."
        sleep(delay)
        retry
      else
        puts "All retry attempts exhausted"
        raise e
      end
    end
  end

  private

  def self.calculate_delay(attempt, strategy)
    case strategy
    when :exponential
      2 ** attempt
    when :linear
      attempt * 2
    when :constant
      3
    else
      1
    end
  end
end

# Usage
begin
  data = RetryHandler.with_retry(max_retries: 5, backoff: :exponential) do
    fetch_and_parse_page('https://example.com')
  end
rescue StandardError => e
  puts "Failed to fetch data after all retries: #{e.message}"
end

Comprehensive Error Handling Class

Here's a complete example of a robust web scraper with comprehensive error handling:

require 'httparty'
require 'nokogiri'
require 'logger'

class RobustWebScraper
  include HTTParty

  def initialize
    @logger = Logger.new(STDOUT)
    @logger.level = Logger::INFO

    # Configure HTTParty
    self.class.timeout 30
    self.class.follow_redirects true
    self.class.headers 'User-Agent' => 'Mozilla/5.0 (compatible; RubyBot/1.0)'
  end

  def scrape_page(url)
    @logger.info("Starting to scrape: #{url}")

    begin
      validate_url(url)
      html_content = fetch_page_with_retry(url)
      parsed_data = parse_content(html_content)

      @logger.info("Successfully scraped: #{url}")
      parsed_data

    rescue ValidationError => e
      @logger.error("Validation error: #{e.message}")
      nil
    rescue NetworkError => e
      @logger.error("Network error: #{e.message}")
      nil
    rescue ParseError => e
      @logger.error("Parse error: #{e.message}")
      nil
    rescue StandardError => e
      @logger.error("Unexpected error: #{e.message}")
      @logger.error(e.backtrace.join("\n"))
      nil
    end
  end

  private

  def validate_url(url)
    raise ValidationError, "URL cannot be nil or empty" if url.nil? || url.strip.empty?

    begin
      uri = URI.parse(url)
      raise ValidationError, "Invalid URL format" unless uri.is_a?(URI::HTTP) || uri.is_a?(URI::HTTPS)
    rescue URI::InvalidURIError
      raise ValidationError, "Malformed URL: #{url}"
    end
  end

  def fetch_page_with_retry(url, max_retries: 3)
    retries = 0

    begin
      response = self.class.get(url)

      case response.code
      when 200
        response.body
      when 404
        raise NetworkError, "Page not found: #{url}"
      when 403
        raise NetworkError, "Access forbidden: #{url}"
      when 429
        raise NetworkError, "Rate limited. Consider using delays between requests."
      when 500..599
        raise NetworkError, "Server error (#{response.code}): #{url}"
      else
        raise NetworkError, "Unexpected response code: #{response.code}"
      end

    rescue Net::TimeoutError, Net::OpenTimeout, Net::ReadTimeout => e
      retries += 1

      if retries <= max_retries
        delay = 2 ** retries
        @logger.warn("Timeout error, retrying in #{delay}s... (#{retries}/#{max_retries})")
        sleep(delay)
        retry
      else
        raise NetworkError, "Timeout after #{max_retries} retries: #{e.message}"
      end

    rescue SocketError, Errno::ECONNREFUSED => e
      raise NetworkError, "Connection error: #{e.message}"
    rescue HTTParty::Error => e
      raise NetworkError, "HTTParty error: #{e.message}"
    end
  end

  def parse_content(html_content)
    begin
      doc = Nokogiri::HTML(html_content)

      # Extract data with safe navigation
      extracted_data = {
        title: safe_extract_text(doc, 'title'),
        description: safe_extract_text(doc, 'meta[name="description"]', 'content'),
        headings: safe_extract_multiple(doc, 'h1, h2, h3'),
        links: safe_extract_links(doc)
      }

      validate_extracted_data(extracted_data)
      extracted_data

    rescue Nokogiri::XML::SyntaxError => e
      raise ParseError, "HTML parsing failed: #{e.message}"
    rescue StandardError => e
      raise ParseError, "Data extraction failed: #{e.message}"
    end
  end

  def safe_extract_text(doc, selector, attribute = nil)
    element = doc.css(selector).first
    return nil unless element

    if attribute
      element[attribute]
    else
      element.text.strip
    end
  rescue StandardError => e
    @logger.warn("Failed to extract text from '#{selector}': #{e.message}")
    nil
  end

  def safe_extract_multiple(doc, selector)
    doc.css(selector).map { |el| el.text.strip }.reject(&:empty?)
  rescue StandardError => e
    @logger.warn("Failed to extract multiple elements '#{selector}': #{e.message}")
    []
  end

  def safe_extract_links(doc)
    doc.css('a[href]').map { |link| link['href'] }.compact.uniq
  rescue StandardError => e
    @logger.warn("Failed to extract links: #{e.message}")
    []
  end

  def validate_extracted_data(data)
    if data[:title].nil? || data[:title].empty?
      @logger.warn("No title found on page")
    end

    if data[:links].empty?
      @logger.warn("No links found on page")
    end
  end
end

# Custom exception classes
class ValidationError < StandardError; end
class NetworkError < StandardError; end
class ParseError < StandardError; end

# Usage example
scraper = RobustWebScraper.new
result = scraper.scrape_page('https://example.com')

if result
  puts "Scraping successful!"
  puts "Title: #{result[:title]}"
  puts "Found #{result[:links].length} links"
else
  puts "Scraping failed. Check logs for details."
end

Rate Limiting and Respectful Scraping

Implement delays and respect robots.txt to avoid getting blocked:

class RespectfulScraper
  def initialize(delay: 1)
    @delay = delay
    @last_request_time = nil
  end

  def fetch_with_delay(url)
    enforce_delay

    begin
      response = HTTParty.get(url)
      @last_request_time = Time.now
      response
    rescue StandardError => e
      puts "Error fetching #{url}: #{e.message}"
      raise e
    end
  end

  private

  def enforce_delay
    return unless @last_request_time

    elapsed = Time.now - @last_request_time
    if elapsed < @delay
      sleep_time = @delay - elapsed
      puts "Sleeping for #{sleep_time.round(2)} seconds..."
      sleep(sleep_time)
    end
  end
end

Monitoring and Alerting

Set up proper logging and monitoring for production scrapers:

require 'logger'

class ProductionScraper
  def initialize
    @logger = setup_logger
    @error_count = 0
    @success_count = 0
  end

  def scrape_with_monitoring(urls)
    urls.each do |url|
      begin
        result = scrape_page(url)
        @success_count += 1
        @logger.info("Success: #{url}")
      rescue StandardError => e
        @error_count += 1
        @logger.error("Failed: #{url} - #{e.message}")

        # Alert if error rate is too high
        check_error_rate
      end
    end

    log_summary
  end

  private

  def setup_logger
    logger = Logger.new('scraper.log')
    logger.level = Logger::INFO
    logger.formatter = proc do |severity, datetime, progname, msg|
      "#{datetime.strftime('%Y-%m-%d %H:%M:%S')} [#{severity}] #{msg}\n"
    end
    logger
  end

  def check_error_rate
    total_requests = @success_count + @error_count
    error_rate = @error_count.to_f / total_requests

    if error_rate > 0.5 && total_requests > 10
      @logger.error("HIGH ERROR RATE ALERT: #{(error_rate * 100).round(1)}%")
      # In production, you might send an email or Slack notification here
    end
  end

  def log_summary
    total = @success_count + @error_count
    @logger.info("Scraping completed. Success: #{@success_count}, Errors: #{@error_count}, Total: #{total}")
  end
end

Best Practices Summary

  1. Always use specific exception handling rather than catching all StandardError
  2. Implement retry logic with exponential backoff for transient failures
  3. Add proper logging to track successes, failures, and performance
  4. Validate inputs and outputs to catch issues early
  5. Respect rate limits and implement delays between requests
  6. Monitor error rates and set up alerts for production systems
  7. Use safe navigation (&.) when extracting data from parsed HTML
  8. Handle different HTTP status codes appropriately
  9. Implement circuit breaker patterns for unreliable services
  10. Test error scenarios in your development environment

Understanding how to handle errors in Puppeteer can provide additional insights for browser-based scraping scenarios, while handling timeouts in Puppeteer offers complementary timeout management strategies.

By implementing these comprehensive error handling strategies, your Ruby web scraping applications will be more robust, reliable, and maintainable in production environments.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon