What is the Proper Way to Handle HTTP Requests with Nokogiri?
Nokogiri is a powerful HTML and XML parser for Ruby, but it doesn't handle HTTP requests directly. To properly use Nokogiri for web scraping, you need to combine it with an HTTP client library. This guide covers the best practices for making HTTP requests and parsing responses with Nokogiri.
Understanding Nokogiri's Role
Nokogiri is primarily a parsing library that processes HTML or XML content. It doesn't fetch web pages itself, so you need to:
- Make HTTP requests using an HTTP client library
- Pass the response body to Nokogiri for parsing
- Extract data using Nokogiri's CSS selectors or XPath
Recommended HTTP Libraries for Nokogiri
1. Net::HTTP (Built-in)
Ruby's built-in Net::HTTP
library is suitable for simple requests:
require 'nokogiri'
require 'net/http'
require 'uri'
def fetch_with_net_http(url)
uri = URI(url)
response = Net::HTTP.get_response(uri)
if response.code == '200'
doc = Nokogiri::HTML(response.body)
return doc
else
raise "HTTP Error: #{response.code}"
end
end
# Usage
doc = fetch_with_net_http('https://example.com')
title = doc.css('title').text
puts title
2. HTTParty (Recommended)
HTTParty provides a more user-friendly interface and better error handling:
require 'nokogiri'
require 'httparty'
class WebScraper
include HTTParty
def self.scrape_page(url)
response = get(url, {
headers: {
'User-Agent' => 'Mozilla/5.0 (compatible; WebScraper/1.0)'
},
timeout: 30
})
if response.success?
Nokogiri::HTML(response.body)
else
raise "Failed to fetch #{url}: #{response.code}"
end
end
end
# Usage
doc = WebScraper.scrape_page('https://example.com')
links = doc.css('a').map { |link| link['href'] }
3. Faraday (Most Flexible)
Faraday offers the most flexibility with middleware support:
require 'nokogiri'
require 'faraday'
require 'faraday/follow_redirects'
def create_http_client
Faraday.new do |faraday|
faraday.request :url_encoded
faraday.response :follow_redirects
faraday.response :raise_error
faraday.adapter Faraday.default_adapter
faraday.options.timeout = 30
faraday.options.open_timeout = 10
end
end
def scrape_with_faraday(url)
client = create_http_client
response = client.get(url) do |req|
req.headers['User-Agent'] = 'Mozilla/5.0 (compatible; WebScraper/1.0)'
end
Nokogiri::HTML(response.body)
end
# Usage
doc = scrape_with_faraday('https://example.com')
Best Practices for HTTP Requests with Nokogiri
1. Proper Error Handling
Always implement comprehensive error handling for HTTP requests:
require 'nokogiri'
require 'httparty'
require 'timeout'
class RobustScraper
include HTTParty
def self.safe_scrape(url, retries: 3)
attempt = 0
begin
attempt += 1
response = get(url, {
headers: { 'User-Agent' => generate_user_agent },
timeout: 30,
follow_redirects: true
})
case response.code
when 200
return Nokogiri::HTML(response.body)
when 404
raise "Page not found: #{url}"
when 429
sleep(2 ** attempt) # Exponential backoff
raise "Rate limited" if attempt >= retries
retry
when 500..599
raise "Server error: #{response.code}"
else
raise "Unexpected response: #{response.code}"
end
rescue Net::TimeoutError, Timeout::Error
if attempt < retries
sleep(1)
retry
else
raise "Timeout after #{retries} attempts"
end
rescue StandardError => e
if attempt < retries
sleep(1)
retry
else
raise "Failed to scrape #{url}: #{e.message}"
end
end
end
private
def self.generate_user_agent
agents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36'
]
agents.sample
end
end
2. Session Management
For sites requiring authentication or session management:
require 'nokogiri'
require 'httparty'
class SessionScraper
include HTTParty
def initialize
@cookies = {}
end
def login(login_url, username, password)
# Get login form
doc = fetch_page(login_url)
# Extract CSRF token if present
csrf_token = doc.css('input[name="authenticity_token"]').first&.[]('value')
# Submit login form
login_data = {
username: username,
password: password
}
login_data[:authenticity_token] = csrf_token if csrf_token
response = self.class.post(login_url, {
body: login_data,
headers: headers_with_cookies,
follow_redirects: false
})
# Store session cookies
store_cookies(response.headers['set-cookie'])
response.success?
end
def fetch_page(url)
response = self.class.get(url, headers: headers_with_cookies)
Nokogiri::HTML(response.body) if response.success?
end
private
def headers_with_cookies
headers = { 'User-Agent' => 'Mozilla/5.0 (compatible; WebScraper/1.0)' }
headers['Cookie'] = format_cookies unless @cookies.empty?
headers
end
def store_cookies(cookie_header)
return unless cookie_header
cookie_header.each do |cookie_string|
cookie_string.split(';').first.split('=', 2).tap do |name, value|
@cookies[name] = value
end
end
end
def format_cookies
@cookies.map { |name, value| "#{name}=#{value}" }.join('; ')
end
end
3. Handling Different Content Types
Ensure proper handling of different response content types:
def parse_response(response)
content_type = response.headers['content-type']
case content_type
when /html/
Nokogiri::HTML(response.body)
when /xml/
Nokogiri::XML(response.body)
when /json/
JSON.parse(response.body)
else
response.body
end
end
4. Rate Limiting and Delays
Implement proper rate limiting to avoid overwhelming servers:
class PoliteeScraper
def initialize(delay: 1)
@delay = delay
@last_request_time = Time.now - delay
end
def scrape(url)
enforce_delay
response = HTTParty.get(url, headers: default_headers)
@last_request_time = Time.now
Nokogiri::HTML(response.body) if response.success?
end
private
def enforce_delay
time_since_last = Time.now - @last_request_time
sleep(@delay - time_since_last) if time_since_last < @delay
end
def default_headers
{
'User-Agent' => 'Mozilla/5.0 (compatible; WebScraper/1.0)',
'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
}
end
end
Advanced Techniques
1. Proxy Support
For scraping that requires IP rotation:
class ProxyScraper
def initialize(proxies)
@proxies = proxies
@current_proxy = 0
end
def scrape_with_proxy(url)
proxy = @proxies[@current_proxy]
@current_proxy = (@current_proxy + 1) % @proxies.length
response = HTTParty.get(url, {
http_proxyaddr: proxy[:host],
http_proxyport: proxy[:port],
http_proxyuser: proxy[:username],
http_proxypass: proxy[:password],
headers: default_headers
})
Nokogiri::HTML(response.body) if response.success?
end
end
2. Concurrent Requests
For improved performance when scraping multiple pages:
require 'concurrent'
require 'nokogiri'
require 'httparty'
def scrape_urls_concurrently(urls, max_threads: 5)
pool = Concurrent::ThreadPoolExecutor.new(
min_threads: 1,
max_threads: max_threads,
max_queue: urls.length
)
futures = urls.map do |url|
Concurrent::Future.execute(executor: pool) do
begin
response = HTTParty.get(url, timeout: 30)
{
url: url,
doc: Nokogiri::HTML(response.body),
success: true
}
rescue StandardError => e
{
url: url,
error: e.message,
success: false
}
end
end
end
futures.map(&:value)
ensure
pool.shutdown
pool.wait_for_termination(30)
end
Integration with Modern Web Scraping
While Nokogiri excels at parsing static HTML, modern web applications often require JavaScript execution. For comprehensive web scraping solutions that handle dynamic content, consider using specialized tools. For applications requiring both static parsing and dynamic content handling, you might need to combine Nokogiri with headless browser solutions.
Common Pitfalls to Avoid
- Not setting User-Agent headers - Many sites block requests without proper User-Agent headers
- Ignoring rate limits - Always implement delays between requests
- Poor error handling - Network requests can fail for many reasons
- Not handling redirects - Use libraries that automatically follow redirects
- Ignoring content encoding - Some responses use gzip compression
Testing Your HTTP/Nokogiri Integration
require 'webmock/rspec'
RSpec.describe 'Web scraping with Nokogiri' do
before do
WebMock.disable_net_connect!(allow_localhost: true)
end
it 'parses HTML correctly' do
html_content = '<html><title>Test Page</title></html>'
stub_request(:get, 'https://example.com')
.to_return(status: 200, body: html_content)
doc = fetch_with_net_http('https://example.com')
expect(doc.css('title').text).to eq('Test Page')
end
it 'handles HTTP errors gracefully' do
stub_request(:get, 'https://example.com')
.to_return(status: 404)
expect {
fetch_with_net_http('https://example.com')
}.to raise_error(/HTTP Error: 404/)
end
end
Conclusion
Properly handling HTTP requests with Nokogiri requires combining a reliable HTTP client library with robust error handling, rate limiting, and proper header management. Whether you choose Net::HTTP for simplicity, HTTParty for ease of use, or Faraday for flexibility, the key is implementing comprehensive error handling and respecting the target server's resources.
Remember to always check robots.txt files, implement appropriate delays between requests, and handle various HTTP status codes gracefully. With these practices, you'll build reliable web scraping applications that work consistently across different websites and scenarios.