What are Ruby's Built-in HTTP Libraries and When Should I Use Them for Scraping?
Ruby provides several built-in HTTP libraries that are powerful tools for web scraping without requiring external dependencies. Understanding when and how to use each library can significantly improve your scraping projects' efficiency and maintainability. This guide covers Ruby's primary HTTP libraries and their optimal use cases.
Ruby's Built-in HTTP Libraries Overview
Ruby includes several HTTP libraries in its standard library, each with distinct strengths:
- Net::HTTP - The foundational HTTP client library
- OpenURI - Simplified interface for opening URLs
- Net::HTTPSession - Session management for persistent connections
- URI - URL parsing and manipulation utilities
Net::HTTP: The Foundation
Net::HTTP is Ruby's core HTTP library, offering fine-grained control over HTTP requests and responses. It's ideal for complex scraping scenarios requiring custom headers, authentication, and detailed error handling.
Basic Net::HTTP Usage
require 'net/http'
require 'uri'
def scrape_with_net_http(url)
uri = URI(url)
Net::HTTP.start(uri.host, uri.port, use_ssl: uri.scheme == 'https') do |http|
request = Net::HTTP::Get.new(uri)
request['User-Agent'] = 'Mozilla/5.0 (compatible; Ruby scraper)'
response = http.request(request)
case response
when Net::HTTPSuccess
response.body
when Net::HTTPRedirection
location = response['Location']
puts "Redirected to: #{location}"
scrape_with_net_http(location)
else
raise "HTTP Error: #{response.code} #{response.message}"
end
end
end
# Usage
html_content = scrape_with_net_http('https://example.com')
Advanced Net::HTTP with Custom Headers
require 'net/http'
require 'json'
class WebScraper
def initialize
@headers = {
'User-Agent' => 'Mozilla/5.0 (compatible; Ruby scraper)',
'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language' => 'en-US,en;q=0.5',
'Accept-Encoding' => 'gzip, deflate',
'Connection' => 'keep-alive'
}
end
def get_with_session(url, cookies = nil)
uri = URI(url)
Net::HTTP.start(uri.host, uri.port, use_ssl: uri.scheme == 'https') do |http|
request = Net::HTTP::Get.new(uri)
@headers.each { |key, value| request[key] = value }
request['Cookie'] = cookies if cookies
response = http.request(request)
{
body: response.body,
cookies: response.get_fields('Set-Cookie'),
status: response.code.to_i
}
end
end
def post_form_data(url, form_data)
uri = URI(url)
Net::HTTP.start(uri.host, uri.port, use_ssl: uri.scheme == 'https') do |http|
request = Net::HTTP::Post.new(uri)
request.set_form_data(form_data)
@headers.each { |key, value| request[key] = value }
response = http.request(request)
response.body
end
end
end
# Usage
scraper = WebScraper.new
result = scraper.get_with_session('https://httpbin.org/get')
puts result[:body]
OpenURI: Simplified HTTP Access
OpenURI provides a simple interface for opening URLs, making it perfect for straightforward scraping tasks. It automatically handles redirects and supports basic authentication.
Basic OpenURI Usage
require 'open-uri'
def simple_scrape(url)
begin
URI.open(url) do |response|
response.read
end
rescue OpenURI::HTTPError => e
puts "HTTP Error: #{e.message}"
nil
rescue => e
puts "Error: #{e.message}"
nil
end
end
# Usage
content = simple_scrape('https://example.com')
puts content if content
OpenURI with Custom Options
require 'open-uri'
def scrape_with_options(url)
options = {
'User-Agent' => 'Mozilla/5.0 (compatible; Ruby scraper)',
'Referer' => 'https://google.com',
read_timeout: 10,
open_timeout: 5
}
begin
URI.open(url, options) do |response|
{
content: response.read,
content_type: response.content_type,
charset: response.charset,
last_modified: response.last_modified
}
end
rescue => e
puts "Error scraping #{url}: #{e.message}"
nil
end
end
# Usage with metadata
result = scrape_with_options('https://example.com')
if result
puts "Content-Type: #{result[:content_type]}"
puts "Charset: #{result[:charset]}"
puts "Content length: #{result[:content].length}"
end
When to Use Each Library
Use Net::HTTP When:
- Complex Authentication Required
# OAuth or custom authentication
request['Authorization'] = "Bearer #{access_token}"
- Session Management Needed
# Maintaining cookies across requests
Net::HTTP.start(host, port) do |http|
# Multiple requests with persistent connection
end
- Custom Request Methods Required
# PATCH, PUT, DELETE requests
request = Net::HTTP::Patch.new(uri)
request.body = JSON.generate(data)
- Detailed Error Handling
case response
when Net::HTTPUnauthorized
refresh_token_and_retry
when Net::HTTPTooManyRequests
implement_backoff_strategy
end
Use OpenURI When:
- Simple GET Requests
- Prototype Development
- One-off Data Fetching
- File Downloads
# Simple file download
URI.open('https://example.com/image.jpg', 'rb') do |file|
File.open('downloaded_image.jpg', 'wb') do |output|
output.write(file.read)
end
end
Complete Scraping Example
Here's a practical example combining both libraries for a real scraping scenario:
require 'net/http'
require 'open-uri'
require 'nokogiri'
require 'json'
class ProductScraper
def initialize
@session_cookies = nil
end
def scrape_product_list(base_url)
# Use OpenURI for simple page fetching
html = URI.open(base_url,
'User-Agent' => 'Mozilla/5.0 (compatible; Ruby scraper)'
).read
doc = Nokogiri::HTML(html)
product_links = doc.css('a.product-link').map do |link|
URI.join(base_url, link['href']).to_s
end
# Use Net::HTTP for detailed product scraping
products = product_links.map do |url|
scrape_product_details(url)
end
products.compact
end
private
def scrape_product_details(url)
uri = URI(url)
Net::HTTP.start(uri.host, uri.port, use_ssl: true) do |http|
request = Net::HTTP::Get.new(uri)
request['User-Agent'] = 'Mozilla/5.0 (compatible; Ruby scraper)'
request['Cookie'] = @session_cookies if @session_cookies
response = http.request(request)
if response.is_a?(Net::HTTPSuccess)
# Store cookies for session continuity
@session_cookies = response.get_fields('Set-Cookie')&.join('; ')
parse_product_data(response.body)
else
puts "Failed to fetch #{url}: #{response.code}"
nil
end
end
rescue => e
puts "Error scraping #{url}: #{e.message}"
nil
end
def parse_product_data(html)
doc = Nokogiri::HTML(html)
{
title: doc.css('h1.product-title').text.strip,
price: doc.css('.price').text.strip,
description: doc.css('.product-description').text.strip,
images: doc.css('img.product-image').map { |img| img['src'] }
}
end
end
# Usage
scraper = ProductScraper.new
products = scraper.scrape_product_list('https://example-store.com/products')
puts JSON.pretty_generate(products)
Best Practices and Performance Tips
1. Connection Reuse
# Efficient: Reuse connections
Net::HTTP.start(host, port) do |http|
urls.each do |path|
response = http.get(path)
process_response(response)
end
end
# Inefficient: New connection per request
urls.each do |url|
Net::HTTP.get_response(URI(url))
end
2. Timeout Configuration
http = Net::HTTP.new(uri.host, uri.port)
http.open_timeout = 5 # Connection timeout
http.read_timeout = 10 # Read timeout
http.use_ssl = true if uri.scheme == 'https'
3. Error Handling and Retries
def fetch_with_retry(url, max_retries = 3)
retries = 0
begin
URI.open(url).read
rescue Net::TimeoutError, SocketError => e
retries += 1
if retries <= max_retries
sleep(2 ** retries) # Exponential backoff
retry
else
raise e
end
end
end
Comparison with External Libraries
While Ruby's built-in libraries are powerful, consider external alternatives for advanced scenarios:
- HTTParty: Simpler API than Net::HTTP
- Faraday: Middleware-based HTTP client
- RestClient: Simple REST API client
However, built-in libraries offer: - Zero dependencies: No gem management required - Stability: Part of Ruby core, thoroughly tested - Performance: Optimized for Ruby's internals
For scenarios requiring advanced browser automation capabilities, consider using headless browser solutions, though Ruby's HTTP libraries excel at API-based scraping and simple HTML retrieval.
Conclusion
Ruby's built-in HTTP libraries provide robust foundation for web scraping projects. Use Net::HTTP for complex scenarios requiring fine control over requests, sessions, and error handling. Choose OpenURI for simple, straightforward scraping tasks. Both libraries offer excellent performance and reliability without external dependencies, making them ideal choices for production scraping applications.
When building scalable scraping solutions, combining these libraries with proper error handling, rate limiting, and respectful scraping practices ensures both effectiveness and maintainability in your Ruby applications.