What is the difference between open-uri and net/http in Ruby for web scraping?

When it comes to web scraping in Ruby, developers often choose between two primary HTTP libraries: open-uri and net/http. Both libraries allow you to make HTTP requests and retrieve web content, but they differ significantly in their approach, complexity, and capabilities. Understanding these differences is crucial for selecting the right tool for your web scraping projects.

Overview of open-uri vs net/http

open-uri is a high-level wrapper around Ruby's net/http library that provides a simplified interface for making HTTP requests. It's designed to make common HTTP operations as simple as opening a local file, hence the name. In contrast, net/http is Ruby's standard, low-level HTTP client library that offers more granular control over HTTP requests and responses.

Basic Syntax Comparison

Using open-uri

require 'open-uri'
require 'nokogiri'

# Simple GET request
html = URI.open('https://example.com').read
doc = Nokogiri::HTML(html)

# With basic options
html = URI.open('https://example.com', 
  'User-Agent' => 'Mozilla/5.0 (compatible; Ruby scraper)',
  'Accept' => 'text/html'
).read

Using net/http

require 'net/http'
require 'nokogiri'
require 'uri'

# Simple GET request
uri = URI('https://example.com')
response = Net::HTTP.get_response(uri)
html = response.body
doc = Nokogiri::HTML(html)

# With more control
uri = URI('https://example.com')
http = Net::HTTP.new(uri.host, uri.port)
http.use_ssl = true if uri.scheme == 'https'

request = Net::HTTP::Get.new(uri)
request['User-Agent'] = 'Mozilla/5.0 (compatible; Ruby scraper)'
request['Accept'] = 'text/html'

response = http.request(request)
html = response.body

Key Differences

1. Simplicity and Ease of Use

open-uri wins in simplicity: - Minimal code required for basic requests - Automatic handling of redirects (up to 5 by default) - Built-in support for basic authentication - Automatic SSL/TLS handling

net/http requires more boilerplate but offers control: - More verbose syntax for basic operations - Manual configuration of SSL, redirects, and other options - Explicit connection management

2. Feature Set and Flexibility

open-uri limitations: - Only supports GET requests - Limited customization options - Cannot reuse connections - No built-in cookie handling - Limited timeout control

net/http advantages: - Supports all HTTP methods (GET, POST, PUT, DELETE, etc.) - Full control over headers, timeouts, and connection parameters - Connection reuse and persistent connections - Built-in cookie support with Net::HTTP::Cookie - Custom SSL certificate handling

3. Performance Considerations

For single requests, both libraries perform similarly. However, for multiple requests:

# net/http with connection reuse (faster for multiple requests)
uri = URI('https://api.example.com')
http = Net::HTTP.new(uri.host, uri.port)
http.use_ssl = true
http.start do |connection|
  10.times do |i|
    request = Net::HTTP::Get.new("/data/#{i}")
    response = connection.request(request)
    # Process response
  end
end

# open-uri creates new connection each time (slower)
10.times do |i|
  html = URI.open("https://api.example.com/data/#{i}").read
  # Process html
end

When to Use Each Library

Choose open-uri when:

Simple GET requests: You only need to fetch web pages or API responses
Prototyping: Quick scripts and proof-of-concepts
Basic scraping: Scraping static websites without complex requirements
Minimal dependencies: You want to use Ruby's standard library

require 'open-uri'
require 'json'

# Perfect for simple API calls
api_response = URI.open('https://api.github.com/users/octocat').read
user_data = JSON.parse(api_response)
puts user_data['name']

Choose net/http when:

Form submissions: You need to POST data or submit forms
Session management: Working with login sessions and cookies
Performance-critical applications: Making many requests to the same host
Complex authentication: OAuth, custom headers, or certificate-based auth
Error handling: You need detailed response information

require 'net/http'
require 'json'

# Better for form submissions and session handling
uri = URI('https://api.example.com/login')
http = Net::HTTP.new(uri.host, uri.port)
http.use_ssl = true

# POST login credentials
login_request = Net::HTTP::Post.new(uri)
login_request.set_form_data('username' => 'user', 'password' => 'pass')
login_response = http.request(login_request)

# Reuse session for authenticated requests
if login_response.code == '200'
  cookies = login_response.get_fields('Set-Cookie')

  data_request = Net::HTTP::Get.new('/protected-data')
  data_request['Cookie'] = cookies.join('; ') if cookies
  data_response = http.request(data_request)
end

Advanced Web Scraping Patterns

Handling Redirects and Errors

# open-uri handles redirects automatically
begin
  content = URI.open('https://example.com/redirect-page').read
rescue OpenURI::HTTPError => e
  puts "HTTP Error: #{e.message}"
rescue => e
  puts "Other error: #{e.message}"
end

# net/http requires manual redirect handling
def fetch_with_redirects(uri, limit = 5)
  raise 'Too many redirects' if limit == 0

  response = Net::HTTP.get_response(uri)

  case response
  when Net::HTTPSuccess
    response.body
  when Net::HTTPRedirection
    new_uri = URI(response['location'])
    fetch_with_redirects(new_uri, limit - 1)
  else
    raise "HTTP Error: #{response.code} #{response.message}"
  end
end

Setting Custom Headers and User Agents

# open-uri approach
headers = {
  'User-Agent' => 'Mozilla/5.0 (compatible; Ruby Bot)',
  'Accept' => 'text/html,application/xhtml+xml',
  'Accept-Language' => 'en-US,en;q=0.9'
}
content = URI.open('https://example.com', headers).read

# net/http approach with more control
uri = URI('https://example.com')
http = Net::HTTP.new(uri.host, uri.port)
http.use_ssl = true
http.read_timeout = 30
http.open_timeout = 10

request = Net::HTTP::Get.new(uri)
request['User-Agent'] = 'Mozilla/5.0 (compatible; Ruby Bot)'
request['Accept'] = 'text/html,application/xhtml+xml'
request['Accept-Language'] = 'en-US,en;q=0.9'

response = http.request(request)

Error Handling and Debugging

open-uri Error Handling

require 'open-uri'

begin
  content = URI.open('https://example.com/page').read
rescue OpenURI::HTTPError => e
  # Handle HTTP errors (4xx, 5xx)
  puts "HTTP Error: #{e.message}"
  puts "Response body: #{e.io.read}" if e.io.respond_to?(:read)
rescue SocketError => e
  # Handle network errors
  puts "Network Error: #{e.message}"
rescue => e
  # Handle other errors
  puts "Unexpected error: #{e.message}"
end

net/http Error Handling

require 'net/http'

uri = URI('https://example.com/page')
begin
  response = Net::HTTP.get_response(uri)

  case response
  when Net::HTTPSuccess
    content = response.body
  when Net::HTTPClientError
    puts "Client Error (4xx): #{response.code} #{response.message}"
  when Net::HTTPServerError
    puts "Server Error (5xx): #{response.code} #{response.message}"
  else
    puts "Unexpected response: #{response.code} #{response.message}"
  end
rescue SocketError => e
  puts "Network Error: #{e.message}"
rescue Timeout::Error => e
  puts "Timeout Error: #{e.message}"
end

Performance Optimization Tips

Connection Reuse with net/http

# Efficient scraping of multiple pages from the same domain
class WebScraper
  def initialize(base_url)
    @uri = URI(base_url)
    @http = Net::HTTP.new(@uri.host, @uri.port)
    @http.use_ssl = true if @uri.scheme == 'https'
    @http.start
  end

  def scrape_page(path)
    request = Net::HTTP::Get.new(path)
    request['User-Agent'] = 'Ruby Web Scraper'

    response = @http.request(request)
    response.body if response.is_a?(Net::HTTPSuccess)
  end

  def close
    @http.finish if @http.started?
  end
end

# Usage
scraper = WebScraper.new('https://example.com')
pages = ['/page1', '/page2', '/page3']
contents = pages.map { |page| scraper.scrape_page(page) }
scraper.close

Integration with Popular Parsing Libraries

Both libraries work seamlessly with HTML parsing libraries like Nokogiri:

require 'nokogiri'

# With open-uri
html = URI.open('https://example.com').read
doc = Nokogiri::HTML(html)
titles = doc.css('h1, h2, h3').map(&:text)

# With net/http
response = Net::HTTP.get_response(URI('https://example.com'))
doc = Nokogiri::HTML(response.body)
titles = doc.css('h1, h2, h3').map(&:text)

For complex web scraping scenarios that require JavaScript execution, you might need to consider browser automation tools. While Ruby doesn't have direct equivalents to how to handle AJAX requests using Puppeteer, you can use headless browser libraries like Watir or integrate with services that provide similar capabilities.

Security Considerations

When scraping websites, always consider security implications:

# Validate SSL certificates (default behavior)
uri = URI('https://example.com')
http = Net::HTTP.new(uri.host, uri.port)
http.use_ssl = true
http.verify_mode = OpenSSL::SSL::VERIFY_PEER

# For development only - disable SSL verification
# http.verify_mode = OpenSSL::SSL::VERIFY_NONE

Conclusion

The choice between open-uri and net/http depends on your specific requirements:

Use open-uri for simple, one-off GET requests where ease of use is paramount
Use net/http for production applications requiring POST requests, session management, connection reuse, or fine-grained control over HTTP behavior

For most web scraping projects that involve form submissions, authentication, or high-performance requirements, net/http is the better choice despite its additional complexity. However, open-uri remains perfect for quick scripts and simple data fetching tasks.

Both libraries are part of Ruby's standard library, so you don't need additional dependencies. Consider combining them in larger applications: use open-uri for simple tasks and net/http for complex operations requiring more control.

Table of contents