Table of contents

What is the Proper Way to Handle HTTP Requests with Nokogiri?

Nokogiri is a powerful HTML and XML parser for Ruby, but it doesn't handle HTTP requests directly. To properly use Nokogiri for web scraping, you need to combine it with an HTTP client library. This guide covers the best practices for making HTTP requests and parsing responses with Nokogiri.

Understanding Nokogiri's Role

Nokogiri is primarily a parsing library that processes HTML or XML content. It doesn't fetch web pages itself, so you need to:

  1. Make HTTP requests using an HTTP client library
  2. Pass the response body to Nokogiri for parsing
  3. Extract data using Nokogiri's CSS selectors or XPath

Recommended HTTP Libraries for Nokogiri

1. Net::HTTP (Built-in)

Ruby's built-in Net::HTTP library is suitable for simple requests:

require 'nokogiri'
require 'net/http'
require 'uri'

def fetch_with_net_http(url)
  uri = URI(url)
  response = Net::HTTP.get_response(uri)

  if response.code == '200'
    doc = Nokogiri::HTML(response.body)
    return doc
  else
    raise "HTTP Error: #{response.code}"
  end
end

# Usage
doc = fetch_with_net_http('https://example.com')
title = doc.css('title').text
puts title

2. HTTParty (Recommended)

HTTParty provides a more user-friendly interface and better error handling:

require 'nokogiri'
require 'httparty'

class WebScraper
  include HTTParty

  def self.scrape_page(url)
    response = get(url, {
      headers: {
        'User-Agent' => 'Mozilla/5.0 (compatible; WebScraper/1.0)'
      },
      timeout: 30
    })

    if response.success?
      Nokogiri::HTML(response.body)
    else
      raise "Failed to fetch #{url}: #{response.code}"
    end
  end
end

# Usage
doc = WebScraper.scrape_page('https://example.com')
links = doc.css('a').map { |link| link['href'] }

3. Faraday (Most Flexible)

Faraday offers the most flexibility with middleware support:

require 'nokogiri'
require 'faraday'
require 'faraday/follow_redirects'

def create_http_client
  Faraday.new do |faraday|
    faraday.request :url_encoded
    faraday.response :follow_redirects
    faraday.response :raise_error
    faraday.adapter Faraday.default_adapter

    faraday.options.timeout = 30
    faraday.options.open_timeout = 10
  end
end

def scrape_with_faraday(url)
  client = create_http_client
  response = client.get(url) do |req|
    req.headers['User-Agent'] = 'Mozilla/5.0 (compatible; WebScraper/1.0)'
  end

  Nokogiri::HTML(response.body)
end

# Usage
doc = scrape_with_faraday('https://example.com')

Best Practices for HTTP Requests with Nokogiri

1. Proper Error Handling

Always implement comprehensive error handling for HTTP requests:

require 'nokogiri'
require 'httparty'
require 'timeout'

class RobustScraper
  include HTTParty

  def self.safe_scrape(url, retries: 3)
    attempt = 0

    begin
      attempt += 1

      response = get(url, {
        headers: { 'User-Agent' => generate_user_agent },
        timeout: 30,
        follow_redirects: true
      })

      case response.code
      when 200
        return Nokogiri::HTML(response.body)
      when 404
        raise "Page not found: #{url}"
      when 429
        sleep(2 ** attempt) # Exponential backoff
        raise "Rate limited" if attempt >= retries
        retry
      when 500..599
        raise "Server error: #{response.code}"
      else
        raise "Unexpected response: #{response.code}"
      end

    rescue Net::TimeoutError, Timeout::Error
      if attempt < retries
        sleep(1)
        retry
      else
        raise "Timeout after #{retries} attempts"
      end
    rescue StandardError => e
      if attempt < retries
        sleep(1)
        retry
      else
        raise "Failed to scrape #{url}: #{e.message}"
      end
    end
  end

  private

  def self.generate_user_agent
    agents = [
      'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
      'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36'
    ]
    agents.sample
  end
end

2. Session Management

For sites requiring authentication or session management:

require 'nokogiri'
require 'httparty'

class SessionScraper
  include HTTParty

  def initialize
    @cookies = {}
  end

  def login(login_url, username, password)
    # Get login form
    doc = fetch_page(login_url)

    # Extract CSRF token if present
    csrf_token = doc.css('input[name="authenticity_token"]').first&.[]('value')

    # Submit login form
    login_data = {
      username: username,
      password: password
    }
    login_data[:authenticity_token] = csrf_token if csrf_token

    response = self.class.post(login_url, {
      body: login_data,
      headers: headers_with_cookies,
      follow_redirects: false
    })

    # Store session cookies
    store_cookies(response.headers['set-cookie'])
    response.success?
  end

  def fetch_page(url)
    response = self.class.get(url, headers: headers_with_cookies)
    Nokogiri::HTML(response.body) if response.success?
  end

  private

  def headers_with_cookies
    headers = { 'User-Agent' => 'Mozilla/5.0 (compatible; WebScraper/1.0)' }
    headers['Cookie'] = format_cookies unless @cookies.empty?
    headers
  end

  def store_cookies(cookie_header)
    return unless cookie_header

    cookie_header.each do |cookie_string|
      cookie_string.split(';').first.split('=', 2).tap do |name, value|
        @cookies[name] = value
      end
    end
  end

  def format_cookies
    @cookies.map { |name, value| "#{name}=#{value}" }.join('; ')
  end
end

3. Handling Different Content Types

Ensure proper handling of different response content types:

def parse_response(response)
  content_type = response.headers['content-type']

  case content_type
  when /html/
    Nokogiri::HTML(response.body)
  when /xml/
    Nokogiri::XML(response.body)
  when /json/
    JSON.parse(response.body)
  else
    response.body
  end
end

4. Rate Limiting and Delays

Implement proper rate limiting to avoid overwhelming servers:

class PoliteeScraper
  def initialize(delay: 1)
    @delay = delay
    @last_request_time = Time.now - delay
  end

  def scrape(url)
    enforce_delay

    response = HTTParty.get(url, headers: default_headers)
    @last_request_time = Time.now

    Nokogiri::HTML(response.body) if response.success?
  end

  private

  def enforce_delay
    time_since_last = Time.now - @last_request_time
    sleep(@delay - time_since_last) if time_since_last < @delay
  end

  def default_headers
    {
      'User-Agent' => 'Mozilla/5.0 (compatible; WebScraper/1.0)',
      'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
    }
  end
end

Advanced Techniques

1. Proxy Support

For scraping that requires IP rotation:

class ProxyScraper
  def initialize(proxies)
    @proxies = proxies
    @current_proxy = 0
  end

  def scrape_with_proxy(url)
    proxy = @proxies[@current_proxy]
    @current_proxy = (@current_proxy + 1) % @proxies.length

    response = HTTParty.get(url, {
      http_proxyaddr: proxy[:host],
      http_proxyport: proxy[:port],
      http_proxyuser: proxy[:username],
      http_proxypass: proxy[:password],
      headers: default_headers
    })

    Nokogiri::HTML(response.body) if response.success?
  end
end

2. Concurrent Requests

For improved performance when scraping multiple pages:

require 'concurrent'
require 'nokogiri'
require 'httparty'

def scrape_urls_concurrently(urls, max_threads: 5)
  pool = Concurrent::ThreadPoolExecutor.new(
    min_threads: 1,
    max_threads: max_threads,
    max_queue: urls.length
  )

  futures = urls.map do |url|
    Concurrent::Future.execute(executor: pool) do
      begin
        response = HTTParty.get(url, timeout: 30)
        {
          url: url,
          doc: Nokogiri::HTML(response.body),
          success: true
        }
      rescue StandardError => e
        {
          url: url,
          error: e.message,
          success: false
        }
      end
    end
  end

  futures.map(&:value)
ensure
  pool.shutdown
  pool.wait_for_termination(30)
end

Integration with Modern Web Scraping

While Nokogiri excels at parsing static HTML, modern web applications often require JavaScript execution. For comprehensive web scraping solutions that handle dynamic content, consider using specialized tools. For applications requiring both static parsing and dynamic content handling, you might need to combine Nokogiri with headless browser solutions.

Common Pitfalls to Avoid

  1. Not setting User-Agent headers - Many sites block requests without proper User-Agent headers
  2. Ignoring rate limits - Always implement delays between requests
  3. Poor error handling - Network requests can fail for many reasons
  4. Not handling redirects - Use libraries that automatically follow redirects
  5. Ignoring content encoding - Some responses use gzip compression

Testing Your HTTP/Nokogiri Integration

require 'webmock/rspec'

RSpec.describe 'Web scraping with Nokogiri' do
  before do
    WebMock.disable_net_connect!(allow_localhost: true)
  end

  it 'parses HTML correctly' do
    html_content = '<html><title>Test Page</title></html>'

    stub_request(:get, 'https://example.com')
      .to_return(status: 200, body: html_content)

    doc = fetch_with_net_http('https://example.com')
    expect(doc.css('title').text).to eq('Test Page')
  end

  it 'handles HTTP errors gracefully' do
    stub_request(:get, 'https://example.com')
      .to_return(status: 404)

    expect {
      fetch_with_net_http('https://example.com')
    }.to raise_error(/HTTP Error: 404/)
  end
end

Conclusion

Properly handling HTTP requests with Nokogiri requires combining a reliable HTTP client library with robust error handling, rate limiting, and proper header management. Whether you choose Net::HTTP for simplicity, HTTParty for ease of use, or Faraday for flexibility, the key is implementing comprehensive error handling and respecting the target server's resources.

Remember to always check robots.txt files, implement appropriate delays between requests, and handle various HTTP status codes gracefully. With these practices, you'll build reliable web scraping applications that work consistently across different websites and scenarios.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon