What are the best practices for error handling in HTTParty when scraping websites?

Best Practices for Error Handling in HTTParty Web Scraping

When web scraping with HTTParty in Ruby, robust error handling is crucial for building reliable scrapers that can gracefully handle network issues, rate limits, server errors, and unexpected responses. Here are the essential best practices:

1. Handle Network and HTTP Exceptions

HTTParty can raise various exceptions that you must catch:

require 'httparty'

begin
  response = HTTParty.get('https://example.com')
rescue HTTParty::Error => e
  puts "HTTParty error: #{e.message}"
rescue Net::TimeoutError => e
  puts "Request timed out: #{e.message}"
rescue SocketError => e
  puts "Network error: #{e.message}"
rescue OpenSSL::SSL::SSLError => e
  puts "SSL error: #{e.message}"
rescue StandardError => e
  puts "Unexpected error: #{e.message}"
end

2. Validate HTTP Status Codes

Always check response codes before processing content:

response = HTTParty.get('https://example.com')

case response.code
when 200
  # Success - process the response
  process_content(response.body)
when 301, 302, 303, 307, 308
  # Redirects (HTTParty handles these automatically by default)
  puts "Redirect received"
when 404
  puts "Page not found"
when 429
  handle_rate_limit(response)
when 500..599
  puts "Server error: #{response.code}"
else
  puts "Unexpected status: #{response.code}"
end

3. Configure Appropriate Timeouts

Set reasonable timeout values to prevent hanging requests:

class WebScraper
  include HTTParty

  # Set global timeout
  default_timeout 30

  # Or set per-request timeouts
  def fetch_page(url)
    HTTParty.get(url, {
      timeout: 15,
      open_timeout: 10,
      read_timeout: 20
    })
  end
end

4. Implement Rate Limit Handling

Respect 429 Too Many Requests responses and Retry-After headers:

def handle_rate_limit(response)
  retry_after = response.headers['Retry-After']

  if retry_after
    wait_time = retry_after.to_i
    puts "Rate limited. Waiting #{wait_time} seconds..."
    sleep(wait_time)
  else
    # Default backoff if no Retry-After header
    sleep(60)
  end
end

5. Use Exponential Backoff for Retries

Implement progressive retry delays for failed requests:

def fetch_with_retry(url, max_retries = 3)
  retries = 0

  begin
    response = HTTParty.get(url, timeout: 30)

    if response.success?
      return response
    elsif response.code == 429
      handle_rate_limit(response)
      raise "Rate limited" # Trigger retry
    else
      raise "HTTP #{response.code}"
    end

  rescue StandardError => e
    retries += 1

    if retries <= max_retries
      wait_time = [2 ** retries, 60].min # Cap at 60 seconds
      puts "Retry #{retries}/#{max_retries} after #{wait_time}s: #{e.message}"
      sleep(wait_time)
      retry
    else
      raise "Max retries exceeded: #{e.message}"
    end
  end
end

6. Comprehensive Error Logging

Log detailed error information for debugging:

require 'logger'

class ScrapingLogger
  def self.logger
    @logger ||= Logger.new('scraper.log')
  end

  def self.log_error(error, url, context = {})
    logger.error({
      timestamp: Time.now.iso8601,
      error_class: error.class.name,
      error_message: error.message,
      url: url,
      backtrace: error.backtrace&.first(5),
      context: context
    }.to_json)
  end
end

7. Complete Error-Resistant Scraper Example

Here's a production-ready scraper with comprehensive error handling:

require 'httparty'
require 'logger'

class RobustScraper
  include HTTParty

  base_uri 'https://example.com'
  default_timeout 30

  def initialize
    @logger = Logger.new('scraper.log')
    @max_retries = 3
  end

  def scrape(path)
    url = "#{self.class.base_uri}#{path}"

    response = fetch_with_retry(url)
    return nil unless response

    process_response(response)

  rescue StandardError => e
    log_error(e, url, { path: path })
    nil
  end

  private

  def fetch_with_retry(url)
    retries = 0

    begin
      options = {
        headers: {
          'User-Agent' => 'Mozilla/5.0 (compatible; WebScraper/1.0)',
          'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
        },
        follow_redirects: true,
        limit: 5 # Max redirect follows
      }

      response = self.class.get(url, options)

      case response.code
      when 200
        return response
      when 404
        @logger.warn("Page not found: #{url}")
        return nil
      when 429
        handle_rate_limit(response)
        raise "Rate limited"
      when 500..599
        raise "Server error: #{response.code}"
      else
        raise "Unexpected status: #{response.code}"
      end

    rescue Net::TimeoutError, SocketError, HTTParty::Error => e
      retries += 1

      if retries <= @max_retries
        wait_time = [2 ** retries, 60].min
        @logger.info("Retrying #{url} (#{retries}/#{@max_retries}) after #{wait_time}s")
        sleep(wait_time)
        retry
      else
        raise "Max retries exceeded: #{e.message}"
      end
    end
  end

  def handle_rate_limit(response)
    retry_after = response.headers['retry-after']&.to_i || 60
    @logger.info("Rate limited. Waiting #{retry_after} seconds...")
    sleep(retry_after)
  end

  def process_response(response)
    # Validate content type
    content_type = response.headers['content-type']
    unless content_type&.include?('text/html')
      @logger.warn("Unexpected content type: #{content_type}")
      return nil
    end

    # Process the HTML content
    doc = Nokogiri::HTML(response.body)

    # Extract data here
    {
      title: doc.css('title').text.strip,
      body_length: response.body.length,
      scraped_at: Time.now
    }
  end

  def log_error(error, url, context = {})
    @logger.error({
      timestamp: Time.now.iso8601,
      error: error.class.name,
      message: error.message,
      url: url,
      context: context
    }.to_json)
  end
end

# Usage
scraper = RobustScraper.new
result = scraper.scrape('/some-page')

8. Additional Best Practices

Handle SSL Certificate Issues

# For development/testing only - don't use in production
HTTParty.get(url, verify: false)

# Better: Configure proper SSL verification
HTTParty.get(url, {
  ssl_ca_file: '/path/to/ca-bundle.crt',
  ssl_verify_mode: OpenSSL::SSL::VERIFY_PEER
})

Manage Cookies and Sessions

# Maintain cookies across requests
options = {
  headers: { 'Cookie' => 'session_id=abc123' },
  follow_redirects: true
}

Set Proper Headers

headers = {
  'User-Agent' => 'Mozilla/5.0 (compatible; YourBot/1.0)',
  'Accept' => 'text/html,application/xhtml+xml',
  'Accept-Language' => 'en-US,en;q=0.9',
  'Accept-Encoding' => 'gzip, deflate',
  'Connection' => 'keep-alive'
}

By implementing these error handling best practices, your HTTParty web scrapers will be more reliable, maintainable, and respectful of target websites' resources.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon