What is the Best Way to Handle Errors and Exceptions in Ruby Web Scraping?
Error handling is a critical aspect of robust Ruby web scraping applications. Without proper exception management, your scrapers can fail silently, crash unexpectedly, or provide unreliable results. This guide covers comprehensive error handling strategies for Ruby web scraping using popular libraries like HTTParty, Net::HTTP, and Nokogiri.
Common Types of Errors in Ruby Web Scraping
Network-Related Errors
Network issues are the most common problems in web scraping:
begin
response = HTTParty.get('https://example.com')
rescue Net::TimeoutError => e
puts "Request timed out: #{e.message}"
rescue Net::OpenTimeout => e
puts "Connection timeout: #{e.message}"
rescue Net::ReadTimeout => e
puts "Read timeout: #{e.message}"
rescue SocketError => e
puts "Network error: #{e.message}"
rescue Errno::ECONNREFUSED => e
puts "Connection refused: #{e.message}"
end
HTTP Status Code Errors
Handle various HTTP response codes appropriately:
require 'httparty'
class WebScraper
def self.fetch_page(url)
response = HTTParty.get(url)
case response.code
when 200
response.body
when 404
raise StandardError, "Page not found: #{url}"
when 403
raise StandardError, "Access forbidden: #{url}"
when 429
raise StandardError, "Rate limited. Please retry later."
when 500..599
raise StandardError, "Server error (#{response.code}): #{url}"
else
raise StandardError, "Unexpected response code: #{response.code}"
end
rescue HTTParty::Error => e
raise StandardError, "HTTParty error: #{e.message}"
end
end
Parsing and Data Extraction Errors
Handle errors when parsing HTML or extracting data:
require 'nokogiri'
def safe_parse_html(html_content)
begin
doc = Nokogiri::HTML(html_content)
# Safe element selection with error handling
title = doc.css('title').first&.text || 'No title found'
# Handle missing elements gracefully
price = doc.css('.price').first&.text&.strip
if price.nil? || price.empty?
puts "Warning: Price not found on page"
price = "N/A"
end
{
title: title,
price: price
}
rescue Nokogiri::XML::SyntaxError => e
puts "HTML parsing error: #{e.message}"
return nil
rescue StandardError => e
puts "Unexpected parsing error: #{e.message}"
return nil
end
end
Implementing Retry Logic
Basic Retry Mechanism
Implement exponential backoff for transient failures:
def fetch_with_retry(url, max_retries: 3, initial_delay: 1)
retries = 0
begin
response = HTTParty.get(url, timeout: 30)
if response.success?
return response.body
else
raise StandardError, "HTTP #{response.code}"
end
rescue Net::TimeoutError, Net::OpenTimeout, Net::ReadTimeout,
SocketError, Errno::ECONNREFUSED => e
retries += 1
if retries <= max_retries
delay = initial_delay * (2 ** (retries - 1))
puts "Retry #{retries}/#{max_retries} after #{delay}s delay. Error: #{e.message}"
sleep(delay)
retry
else
puts "Max retries exceeded. Final error: #{e.message}"
raise e
end
end
end
Advanced Retry with Different Strategies
class RetryHandler
RETRYABLE_ERRORS = [
Net::TimeoutError,
Net::OpenTimeout,
Net::ReadTimeout,
SocketError,
Errno::ECONNREFUSED,
HTTParty::Error
].freeze
def self.with_retry(max_retries: 3, backoff: :exponential)
retries = 0
begin
yield
rescue *RETRYABLE_ERRORS => e
retries += 1
if retries <= max_retries
delay = calculate_delay(retries, backoff)
puts "Attempt #{retries}/#{max_retries} failed: #{e.message}"
puts "Retrying in #{delay} seconds..."
sleep(delay)
retry
else
puts "All retry attempts exhausted"
raise e
end
end
end
private
def self.calculate_delay(attempt, strategy)
case strategy
when :exponential
2 ** attempt
when :linear
attempt * 2
when :constant
3
else
1
end
end
end
# Usage
begin
data = RetryHandler.with_retry(max_retries: 5, backoff: :exponential) do
fetch_and_parse_page('https://example.com')
end
rescue StandardError => e
puts "Failed to fetch data after all retries: #{e.message}"
end
Comprehensive Error Handling Class
Here's a complete example of a robust web scraper with comprehensive error handling:
require 'httparty'
require 'nokogiri'
require 'logger'
class RobustWebScraper
include HTTParty
def initialize
@logger = Logger.new(STDOUT)
@logger.level = Logger::INFO
# Configure HTTParty
self.class.timeout 30
self.class.follow_redirects true
self.class.headers 'User-Agent' => 'Mozilla/5.0 (compatible; RubyBot/1.0)'
end
def scrape_page(url)
@logger.info("Starting to scrape: #{url}")
begin
validate_url(url)
html_content = fetch_page_with_retry(url)
parsed_data = parse_content(html_content)
@logger.info("Successfully scraped: #{url}")
parsed_data
rescue ValidationError => e
@logger.error("Validation error: #{e.message}")
nil
rescue NetworkError => e
@logger.error("Network error: #{e.message}")
nil
rescue ParseError => e
@logger.error("Parse error: #{e.message}")
nil
rescue StandardError => e
@logger.error("Unexpected error: #{e.message}")
@logger.error(e.backtrace.join("\n"))
nil
end
end
private
def validate_url(url)
raise ValidationError, "URL cannot be nil or empty" if url.nil? || url.strip.empty?
begin
uri = URI.parse(url)
raise ValidationError, "Invalid URL format" unless uri.is_a?(URI::HTTP) || uri.is_a?(URI::HTTPS)
rescue URI::InvalidURIError
raise ValidationError, "Malformed URL: #{url}"
end
end
def fetch_page_with_retry(url, max_retries: 3)
retries = 0
begin
response = self.class.get(url)
case response.code
when 200
response.body
when 404
raise NetworkError, "Page not found: #{url}"
when 403
raise NetworkError, "Access forbidden: #{url}"
when 429
raise NetworkError, "Rate limited. Consider using delays between requests."
when 500..599
raise NetworkError, "Server error (#{response.code}): #{url}"
else
raise NetworkError, "Unexpected response code: #{response.code}"
end
rescue Net::TimeoutError, Net::OpenTimeout, Net::ReadTimeout => e
retries += 1
if retries <= max_retries
delay = 2 ** retries
@logger.warn("Timeout error, retrying in #{delay}s... (#{retries}/#{max_retries})")
sleep(delay)
retry
else
raise NetworkError, "Timeout after #{max_retries} retries: #{e.message}"
end
rescue SocketError, Errno::ECONNREFUSED => e
raise NetworkError, "Connection error: #{e.message}"
rescue HTTParty::Error => e
raise NetworkError, "HTTParty error: #{e.message}"
end
end
def parse_content(html_content)
begin
doc = Nokogiri::HTML(html_content)
# Extract data with safe navigation
extracted_data = {
title: safe_extract_text(doc, 'title'),
description: safe_extract_text(doc, 'meta[name="description"]', 'content'),
headings: safe_extract_multiple(doc, 'h1, h2, h3'),
links: safe_extract_links(doc)
}
validate_extracted_data(extracted_data)
extracted_data
rescue Nokogiri::XML::SyntaxError => e
raise ParseError, "HTML parsing failed: #{e.message}"
rescue StandardError => e
raise ParseError, "Data extraction failed: #{e.message}"
end
end
def safe_extract_text(doc, selector, attribute = nil)
element = doc.css(selector).first
return nil unless element
if attribute
element[attribute]
else
element.text.strip
end
rescue StandardError => e
@logger.warn("Failed to extract text from '#{selector}': #{e.message}")
nil
end
def safe_extract_multiple(doc, selector)
doc.css(selector).map { |el| el.text.strip }.reject(&:empty?)
rescue StandardError => e
@logger.warn("Failed to extract multiple elements '#{selector}': #{e.message}")
[]
end
def safe_extract_links(doc)
doc.css('a[href]').map { |link| link['href'] }.compact.uniq
rescue StandardError => e
@logger.warn("Failed to extract links: #{e.message}")
[]
end
def validate_extracted_data(data)
if data[:title].nil? || data[:title].empty?
@logger.warn("No title found on page")
end
if data[:links].empty?
@logger.warn("No links found on page")
end
end
end
# Custom exception classes
class ValidationError < StandardError; end
class NetworkError < StandardError; end
class ParseError < StandardError; end
# Usage example
scraper = RobustWebScraper.new
result = scraper.scrape_page('https://example.com')
if result
puts "Scraping successful!"
puts "Title: #{result[:title]}"
puts "Found #{result[:links].length} links"
else
puts "Scraping failed. Check logs for details."
end
Rate Limiting and Respectful Scraping
Implement delays and respect robots.txt to avoid getting blocked:
class RespectfulScraper
def initialize(delay: 1)
@delay = delay
@last_request_time = nil
end
def fetch_with_delay(url)
enforce_delay
begin
response = HTTParty.get(url)
@last_request_time = Time.now
response
rescue StandardError => e
puts "Error fetching #{url}: #{e.message}"
raise e
end
end
private
def enforce_delay
return unless @last_request_time
elapsed = Time.now - @last_request_time
if elapsed < @delay
sleep_time = @delay - elapsed
puts "Sleeping for #{sleep_time.round(2)} seconds..."
sleep(sleep_time)
end
end
end
Monitoring and Alerting
Set up proper logging and monitoring for production scrapers:
require 'logger'
class ProductionScraper
def initialize
@logger = setup_logger
@error_count = 0
@success_count = 0
end
def scrape_with_monitoring(urls)
urls.each do |url|
begin
result = scrape_page(url)
@success_count += 1
@logger.info("Success: #{url}")
rescue StandardError => e
@error_count += 1
@logger.error("Failed: #{url} - #{e.message}")
# Alert if error rate is too high
check_error_rate
end
end
log_summary
end
private
def setup_logger
logger = Logger.new('scraper.log')
logger.level = Logger::INFO
logger.formatter = proc do |severity, datetime, progname, msg|
"#{datetime.strftime('%Y-%m-%d %H:%M:%S')} [#{severity}] #{msg}\n"
end
logger
end
def check_error_rate
total_requests = @success_count + @error_count
error_rate = @error_count.to_f / total_requests
if error_rate > 0.5 && total_requests > 10
@logger.error("HIGH ERROR RATE ALERT: #{(error_rate * 100).round(1)}%")
# In production, you might send an email or Slack notification here
end
end
def log_summary
total = @success_count + @error_count
@logger.info("Scraping completed. Success: #{@success_count}, Errors: #{@error_count}, Total: #{total}")
end
end
Best Practices Summary
- Always use specific exception handling rather than catching all StandardError
- Implement retry logic with exponential backoff for transient failures
- Add proper logging to track successes, failures, and performance
- Validate inputs and outputs to catch issues early
- Respect rate limits and implement delays between requests
- Monitor error rates and set up alerts for production systems
- Use safe navigation (
&.
) when extracting data from parsed HTML - Handle different HTTP status codes appropriately
- Implement circuit breaker patterns for unreliable services
- Test error scenarios in your development environment
Understanding how to handle errors in Puppeteer can provide additional insights for browser-based scraping scenarios, while handling timeouts in Puppeteer offers complementary timeout management strategies.
By implementing these comprehensive error handling strategies, your Ruby web scraping applications will be more robust, reliable, and maintainable in production environments.