Best Practices for Error Handling in HTTParty Web Scraping
When web scraping with HTTParty in Ruby, robust error handling is crucial for building reliable scrapers that can gracefully handle network issues, rate limits, server errors, and unexpected responses. Here are the essential best practices:
1. Handle Network and HTTP Exceptions
HTTParty can raise various exceptions that you must catch:
require 'httparty'
begin
response = HTTParty.get('https://example.com')
rescue HTTParty::Error => e
puts "HTTParty error: #{e.message}"
rescue Net::TimeoutError => e
puts "Request timed out: #{e.message}"
rescue SocketError => e
puts "Network error: #{e.message}"
rescue OpenSSL::SSL::SSLError => e
puts "SSL error: #{e.message}"
rescue StandardError => e
puts "Unexpected error: #{e.message}"
end
2. Validate HTTP Status Codes
Always check response codes before processing content:
response = HTTParty.get('https://example.com')
case response.code
when 200
# Success - process the response
process_content(response.body)
when 301, 302, 303, 307, 308
# Redirects (HTTParty handles these automatically by default)
puts "Redirect received"
when 404
puts "Page not found"
when 429
handle_rate_limit(response)
when 500..599
puts "Server error: #{response.code}"
else
puts "Unexpected status: #{response.code}"
end
3. Configure Appropriate Timeouts
Set reasonable timeout values to prevent hanging requests:
class WebScraper
include HTTParty
# Set global timeout
default_timeout 30
# Or set per-request timeouts
def fetch_page(url)
HTTParty.get(url, {
timeout: 15,
open_timeout: 10,
read_timeout: 20
})
end
end
4. Implement Rate Limit Handling
Respect 429 Too Many Requests
responses and Retry-After
headers:
def handle_rate_limit(response)
retry_after = response.headers['Retry-After']
if retry_after
wait_time = retry_after.to_i
puts "Rate limited. Waiting #{wait_time} seconds..."
sleep(wait_time)
else
# Default backoff if no Retry-After header
sleep(60)
end
end
5. Use Exponential Backoff for Retries
Implement progressive retry delays for failed requests:
def fetch_with_retry(url, max_retries = 3)
retries = 0
begin
response = HTTParty.get(url, timeout: 30)
if response.success?
return response
elsif response.code == 429
handle_rate_limit(response)
raise "Rate limited" # Trigger retry
else
raise "HTTP #{response.code}"
end
rescue StandardError => e
retries += 1
if retries <= max_retries
wait_time = [2 ** retries, 60].min # Cap at 60 seconds
puts "Retry #{retries}/#{max_retries} after #{wait_time}s: #{e.message}"
sleep(wait_time)
retry
else
raise "Max retries exceeded: #{e.message}"
end
end
end
6. Comprehensive Error Logging
Log detailed error information for debugging:
require 'logger'
class ScrapingLogger
def self.logger
@logger ||= Logger.new('scraper.log')
end
def self.log_error(error, url, context = {})
logger.error({
timestamp: Time.now.iso8601,
error_class: error.class.name,
error_message: error.message,
url: url,
backtrace: error.backtrace&.first(5),
context: context
}.to_json)
end
end
7. Complete Error-Resistant Scraper Example
Here's a production-ready scraper with comprehensive error handling:
require 'httparty'
require 'logger'
class RobustScraper
include HTTParty
base_uri 'https://example.com'
default_timeout 30
def initialize
@logger = Logger.new('scraper.log')
@max_retries = 3
end
def scrape(path)
url = "#{self.class.base_uri}#{path}"
response = fetch_with_retry(url)
return nil unless response
process_response(response)
rescue StandardError => e
log_error(e, url, { path: path })
nil
end
private
def fetch_with_retry(url)
retries = 0
begin
options = {
headers: {
'User-Agent' => 'Mozilla/5.0 (compatible; WebScraper/1.0)',
'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
},
follow_redirects: true,
limit: 5 # Max redirect follows
}
response = self.class.get(url, options)
case response.code
when 200
return response
when 404
@logger.warn("Page not found: #{url}")
return nil
when 429
handle_rate_limit(response)
raise "Rate limited"
when 500..599
raise "Server error: #{response.code}"
else
raise "Unexpected status: #{response.code}"
end
rescue Net::TimeoutError, SocketError, HTTParty::Error => e
retries += 1
if retries <= @max_retries
wait_time = [2 ** retries, 60].min
@logger.info("Retrying #{url} (#{retries}/#{@max_retries}) after #{wait_time}s")
sleep(wait_time)
retry
else
raise "Max retries exceeded: #{e.message}"
end
end
end
def handle_rate_limit(response)
retry_after = response.headers['retry-after']&.to_i || 60
@logger.info("Rate limited. Waiting #{retry_after} seconds...")
sleep(retry_after)
end
def process_response(response)
# Validate content type
content_type = response.headers['content-type']
unless content_type&.include?('text/html')
@logger.warn("Unexpected content type: #{content_type}")
return nil
end
# Process the HTML content
doc = Nokogiri::HTML(response.body)
# Extract data here
{
title: doc.css('title').text.strip,
body_length: response.body.length,
scraped_at: Time.now
}
end
def log_error(error, url, context = {})
@logger.error({
timestamp: Time.now.iso8601,
error: error.class.name,
message: error.message,
url: url,
context: context
}.to_json)
end
end
# Usage
scraper = RobustScraper.new
result = scraper.scrape('/some-page')
8. Additional Best Practices
Handle SSL Certificate Issues
# For development/testing only - don't use in production
HTTParty.get(url, verify: false)
# Better: Configure proper SSL verification
HTTParty.get(url, {
ssl_ca_file: '/path/to/ca-bundle.crt',
ssl_verify_mode: OpenSSL::SSL::VERIFY_PEER
})
Manage Cookies and Sessions
# Maintain cookies across requests
options = {
headers: { 'Cookie' => 'session_id=abc123' },
follow_redirects: true
}
Set Proper Headers
headers = {
'User-Agent' => 'Mozilla/5.0 (compatible; YourBot/1.0)',
'Accept' => 'text/html,application/xhtml+xml',
'Accept-Language' => 'en-US,en;q=0.9',
'Accept-Encoding' => 'gzip, deflate',
'Connection' => 'keep-alive'
}
By implementing these error handling best practices, your HTTParty web scrapers will be more reliable, maintainable, and respectful of target websites' resources.