Table of contents

What are Ruby's Built-in HTTP Libraries and When Should I Use Them for Scraping?

Ruby provides several built-in HTTP libraries that are powerful tools for web scraping without requiring external dependencies. Understanding when and how to use each library can significantly improve your scraping projects' efficiency and maintainability. This guide covers Ruby's primary HTTP libraries and their optimal use cases.

Ruby's Built-in HTTP Libraries Overview

Ruby includes several HTTP libraries in its standard library, each with distinct strengths:

  • Net::HTTP - The foundational HTTP client library
  • OpenURI - Simplified interface for opening URLs
  • Net::HTTPSession - Session management for persistent connections
  • URI - URL parsing and manipulation utilities

Net::HTTP: The Foundation

Net::HTTP is Ruby's core HTTP library, offering fine-grained control over HTTP requests and responses. It's ideal for complex scraping scenarios requiring custom headers, authentication, and detailed error handling.

Basic Net::HTTP Usage

require 'net/http'
require 'uri'

def scrape_with_net_http(url)
  uri = URI(url)

  Net::HTTP.start(uri.host, uri.port, use_ssl: uri.scheme == 'https') do |http|
    request = Net::HTTP::Get.new(uri)
    request['User-Agent'] = 'Mozilla/5.0 (compatible; Ruby scraper)'

    response = http.request(request)

    case response
    when Net::HTTPSuccess
      response.body
    when Net::HTTPRedirection
      location = response['Location']
      puts "Redirected to: #{location}"
      scrape_with_net_http(location)
    else
      raise "HTTP Error: #{response.code} #{response.message}"
    end
  end
end

# Usage
html_content = scrape_with_net_http('https://example.com')

Advanced Net::HTTP with Custom Headers

require 'net/http'
require 'json'

class WebScraper
  def initialize
    @headers = {
      'User-Agent' => 'Mozilla/5.0 (compatible; Ruby scraper)',
      'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
      'Accept-Language' => 'en-US,en;q=0.5',
      'Accept-Encoding' => 'gzip, deflate',
      'Connection' => 'keep-alive'
    }
  end

  def get_with_session(url, cookies = nil)
    uri = URI(url)

    Net::HTTP.start(uri.host, uri.port, use_ssl: uri.scheme == 'https') do |http|
      request = Net::HTTP::Get.new(uri)

      @headers.each { |key, value| request[key] = value }
      request['Cookie'] = cookies if cookies

      response = http.request(request)

      {
        body: response.body,
        cookies: response.get_fields('Set-Cookie'),
        status: response.code.to_i
      }
    end
  end

  def post_form_data(url, form_data)
    uri = URI(url)

    Net::HTTP.start(uri.host, uri.port, use_ssl: uri.scheme == 'https') do |http|
      request = Net::HTTP::Post.new(uri)
      request.set_form_data(form_data)
      @headers.each { |key, value| request[key] = value }

      response = http.request(request)
      response.body
    end
  end
end

# Usage
scraper = WebScraper.new
result = scraper.get_with_session('https://httpbin.org/get')
puts result[:body]

OpenURI: Simplified HTTP Access

OpenURI provides a simple interface for opening URLs, making it perfect for straightforward scraping tasks. It automatically handles redirects and supports basic authentication.

Basic OpenURI Usage

require 'open-uri'

def simple_scrape(url)
  begin
    URI.open(url) do |response|
      response.read
    end
  rescue OpenURI::HTTPError => e
    puts "HTTP Error: #{e.message}"
    nil
  rescue => e
    puts "Error: #{e.message}"
    nil
  end
end

# Usage
content = simple_scrape('https://example.com')
puts content if content

OpenURI with Custom Options

require 'open-uri'

def scrape_with_options(url)
  options = {
    'User-Agent' => 'Mozilla/5.0 (compatible; Ruby scraper)',
    'Referer' => 'https://google.com',
    read_timeout: 10,
    open_timeout: 5
  }

  begin
    URI.open(url, options) do |response|
      {
        content: response.read,
        content_type: response.content_type,
        charset: response.charset,
        last_modified: response.last_modified
      }
    end
  rescue => e
    puts "Error scraping #{url}: #{e.message}"
    nil
  end
end

# Usage with metadata
result = scrape_with_options('https://example.com')
if result
  puts "Content-Type: #{result[:content_type]}"
  puts "Charset: #{result[:charset]}"
  puts "Content length: #{result[:content].length}"
end

When to Use Each Library

Use Net::HTTP When:

  1. Complex Authentication Required
   # OAuth or custom authentication
   request['Authorization'] = "Bearer #{access_token}"
  1. Session Management Needed
   # Maintaining cookies across requests
   Net::HTTP.start(host, port) do |http|
     # Multiple requests with persistent connection
   end
  1. Custom Request Methods Required
   # PATCH, PUT, DELETE requests
   request = Net::HTTP::Patch.new(uri)
   request.body = JSON.generate(data)
  1. Detailed Error Handling
   case response
   when Net::HTTPUnauthorized
     refresh_token_and_retry
   when Net::HTTPTooManyRequests
     implement_backoff_strategy
   end

Use OpenURI When:

  1. Simple GET Requests
  2. Prototype Development
  3. One-off Data Fetching
  4. File Downloads
# Simple file download
URI.open('https://example.com/image.jpg', 'rb') do |file|
  File.open('downloaded_image.jpg', 'wb') do |output|
    output.write(file.read)
  end
end

Complete Scraping Example

Here's a practical example combining both libraries for a real scraping scenario:

require 'net/http'
require 'open-uri'
require 'nokogiri'
require 'json'

class ProductScraper
  def initialize
    @session_cookies = nil
  end

  def scrape_product_list(base_url)
    # Use OpenURI for simple page fetching
    html = URI.open(base_url, 
      'User-Agent' => 'Mozilla/5.0 (compatible; Ruby scraper)'
    ).read

    doc = Nokogiri::HTML(html)

    product_links = doc.css('a.product-link').map do |link|
      URI.join(base_url, link['href']).to_s
    end

    # Use Net::HTTP for detailed product scraping
    products = product_links.map do |url|
      scrape_product_details(url)
    end

    products.compact
  end

  private

  def scrape_product_details(url)
    uri = URI(url)

    Net::HTTP.start(uri.host, uri.port, use_ssl: true) do |http|
      request = Net::HTTP::Get.new(uri)
      request['User-Agent'] = 'Mozilla/5.0 (compatible; Ruby scraper)'
      request['Cookie'] = @session_cookies if @session_cookies

      response = http.request(request)

      if response.is_a?(Net::HTTPSuccess)
        # Store cookies for session continuity
        @session_cookies = response.get_fields('Set-Cookie')&.join('; ')

        parse_product_data(response.body)
      else
        puts "Failed to fetch #{url}: #{response.code}"
        nil
      end
    end
  rescue => e
    puts "Error scraping #{url}: #{e.message}"
    nil
  end

  def parse_product_data(html)
    doc = Nokogiri::HTML(html)

    {
      title: doc.css('h1.product-title').text.strip,
      price: doc.css('.price').text.strip,
      description: doc.css('.product-description').text.strip,
      images: doc.css('img.product-image').map { |img| img['src'] }
    }
  end
end

# Usage
scraper = ProductScraper.new
products = scraper.scrape_product_list('https://example-store.com/products')
puts JSON.pretty_generate(products)

Best Practices and Performance Tips

1. Connection Reuse

# Efficient: Reuse connections
Net::HTTP.start(host, port) do |http|
  urls.each do |path|
    response = http.get(path)
    process_response(response)
  end
end

# Inefficient: New connection per request
urls.each do |url|
  Net::HTTP.get_response(URI(url))
end

2. Timeout Configuration

http = Net::HTTP.new(uri.host, uri.port)
http.open_timeout = 5    # Connection timeout
http.read_timeout = 10   # Read timeout
http.use_ssl = true if uri.scheme == 'https'

3. Error Handling and Retries

def fetch_with_retry(url, max_retries = 3)
  retries = 0

  begin
    URI.open(url).read
  rescue Net::TimeoutError, SocketError => e
    retries += 1
    if retries <= max_retries
      sleep(2 ** retries) # Exponential backoff
      retry
    else
      raise e
    end
  end
end

Comparison with External Libraries

While Ruby's built-in libraries are powerful, consider external alternatives for advanced scenarios:

  • HTTParty: Simpler API than Net::HTTP
  • Faraday: Middleware-based HTTP client
  • RestClient: Simple REST API client

However, built-in libraries offer: - Zero dependencies: No gem management required - Stability: Part of Ruby core, thoroughly tested - Performance: Optimized for Ruby's internals

For scenarios requiring advanced browser automation capabilities, consider using headless browser solutions, though Ruby's HTTP libraries excel at API-based scraping and simple HTML retrieval.

Conclusion

Ruby's built-in HTTP libraries provide robust foundation for web scraping projects. Use Net::HTTP for complex scenarios requiring fine control over requests, sessions, and error handling. Choose OpenURI for simple, straightforward scraping tasks. Both libraries offer excellent performance and reliability without external dependencies, making them ideal choices for production scraping applications.

When building scalable scraping solutions, combining these libraries with proper error handling, rate limiting, and respectful scraping practices ensures both effectiveness and maintainability in your Ruby applications.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon