Table of contents

How do I scrape data from websites using proxy servers with Ruby?

Using proxy servers is essential for web scraping when you need to avoid IP blocking, access geo-restricted content, or distribute requests across multiple IP addresses. Ruby provides several libraries and techniques to implement proxy support in your web scraping projects.

Why Use Proxy Servers for Web Scraping?

Proxy servers act as intermediaries between your scraper and the target website, offering several benefits:

  • IP rotation: Distribute requests across multiple IP addresses to avoid rate limiting
  • Geographic diversity: Access content restricted to specific regions
  • Anonymity: Hide your real IP address from target websites
  • Load distribution: Prevent overwhelming a single IP with too many requests
  • Bypass blocking: Circumvent IP-based blocking mechanisms

Basic Proxy Setup with Net::HTTP

Ruby's built-in Net::HTTP library provides native proxy support:

require 'net/http'
require 'uri'

# Define proxy configuration
proxy_host = 'proxy.example.com'
proxy_port = 8080
proxy_user = 'username'  # Optional
proxy_pass = 'password'  # Optional

# Target URL
target_url = 'https://httpbin.org/ip'
uri = URI(target_url)

# Create HTTP connection through proxy
http = Net::HTTP.new(uri.host, uri.port, proxy_host, proxy_port, proxy_user, proxy_pass)
http.use_ssl = true if uri.scheme == 'https'

# Make request
request = Net::HTTP::Get.new(uri)
response = http.request(request)

puts response.body

Using HTTParty with Proxies

HTTParty is a popular Ruby gem that simplifies HTTP requests and provides clean proxy support:

require 'httparty'

class ProxyClient
  include HTTParty

  # Set default options including proxy
  default_options.update(
    http_proxyaddr: 'proxy.example.com',
    http_proxyport: 8080,
    http_proxyuser: 'username',
    http_proxypass: 'password'
  )
end

# Make request through proxy
response = ProxyClient.get('https://httpbin.org/ip')
puts response.body

For dynamic proxy configuration:

require 'httparty'

def scrape_with_proxy(url, proxy_config)
  options = {
    http_proxyaddr: proxy_config[:host],
    http_proxyport: proxy_config[:port]
  }

  # Add authentication if provided
  if proxy_config[:username]
    options[:http_proxyuser] = proxy_config[:username]
    options[:http_proxypass] = proxy_config[:password]
  end

  HTTParty.get(url, options)
end

# Usage
proxy = {
  host: 'proxy.example.com',
  port: 8080,
  username: 'user',
  password: 'pass'
}

response = scrape_with_proxy('https://httpbin.org/ip', proxy)
puts response.body

Advanced Proxy Management with Faraday

Faraday offers more sophisticated HTTP client capabilities with excellent proxy support:

require 'faraday'

# Create connection with proxy
conn = Faraday.new do |f|
  f.proxy = {
    uri: 'http://username:password@proxy.example.com:8080'
  }
  f.adapter Faraday.default_adapter
end

# Make request
response = conn.get('https://httpbin.org/ip')
puts response.body

For SOCKS proxy support with Faraday:

require 'faraday'
require 'socksify/http'

# Configure SOCKS proxy
conn = Faraday.new do |f|
  f.proxy = {
    uri: 'socks5://127.0.0.1:1080'
  }
  f.adapter :net_http_socks
end

response = conn.get('https://httpbin.org/ip')
puts response.body

Implementing Proxy Rotation

Rotating proxies helps distribute load and avoid detection:

require 'httparty'

class ProxyRotator
  def initialize(proxies)
    @proxies = proxies
    @current_index = 0
  end

  def next_proxy
    proxy = @proxies[@current_index]
    @current_index = (@current_index + 1) % @proxies.length
    proxy
  end

  def make_request(url, max_retries: 3)
    retries = 0

    while retries < max_retries
      proxy = next_proxy

      begin
        options = {
          http_proxyaddr: proxy[:host],
          http_proxyport: proxy[:port],
          timeout: 10
        }

        # Add authentication if available
        if proxy[:username]
          options[:http_proxyuser] = proxy[:username]
          options[:http_proxypass] = proxy[:password]
        end

        response = HTTParty.get(url, options)

        if response.success?
          return response
        else
          raise "HTTP Error: #{response.code}"
        end

      rescue => e
        puts "Request failed with proxy #{proxy[:host]}:#{proxy[:port]} - #{e.message}"
        retries += 1
        sleep(1) # Brief delay before retry
      end
    end

    raise "All proxy attempts failed for #{url}"
  end
end

# Define proxy pool
proxies = [
  { host: 'proxy1.example.com', port: 8080, username: 'user1', password: 'pass1' },
  { host: 'proxy2.example.com', port: 8080, username: 'user2', password: 'pass2' },
  { host: 'proxy3.example.com', port: 8080, username: 'user3', password: 'pass3' }
]

# Usage
rotator = ProxyRotator.new(proxies)
response = rotator.make_request('https://httpbin.org/ip')
puts response.body

Handling Proxy Authentication

Different proxy types require various authentication methods:

Basic HTTP Authentication

require 'httparty'

def request_with_basic_auth(url, proxy_host, proxy_port, username, password)
  options = {
    http_proxyaddr: proxy_host,
    http_proxyport: proxy_port,
    http_proxyuser: username,
    http_proxypass: password,
    headers: {
      'User-Agent' => 'Mozilla/5.0 (compatible; Ruby scraper)'
    }
  }

  HTTParty.get(url, options)
end

Custom Headers for Proxy Authentication

require 'faraday'
require 'base64'

def request_with_custom_auth(url, proxy_uri, auth_token)
  conn = Faraday.new do |f|
    f.proxy = proxy_uri
    f.adapter Faraday.default_adapter
  end

  # Add custom authentication header
  auth_header = "Bearer #{auth_token}"

  conn.get(url) do |req|
    req.headers['Proxy-Authorization'] = auth_header
    req.headers['User-Agent'] = 'Ruby/Faraday Scraper'
  end
end

Error Handling and Proxy Validation

Implement robust error handling for proxy-related issues:

require 'httparty'
require 'timeout'

class RobustProxyScraper
  PROXY_ERRORS = [
    Net::TimeoutError,
    Errno::ECONNREFUSED,
    Errno::ECONNRESET,
    HTTParty::Error,
    SocketError
  ].freeze

  def initialize(proxies, timeout: 30)
    @proxies = proxies
    @timeout = timeout
    @working_proxies = []
    validate_proxies
  end

  def validate_proxies
    @proxies.each do |proxy|
      if proxy_working?(proxy)
        @working_proxies << proxy
        puts "✓ Proxy #{proxy[:host]}:#{proxy[:port]} is working"
      else
        puts "✗ Proxy #{proxy[:host]}:#{proxy[:port]} failed validation"
      end
    end

    raise "No working proxies found" if @working_proxies.empty?
  end

  def proxy_working?(proxy)
    begin
      Timeout::timeout(@timeout) do
        options = build_proxy_options(proxy)
        response = HTTParty.get('https://httpbin.org/ip', options)
        response.success?
      end
    rescue *PROXY_ERRORS => e
      puts "Proxy validation error: #{e.message}"
      false
    end
  end

  def scrape(url)
    @working_proxies.each do |proxy|
      begin
        Timeout::timeout(@timeout) do
          options = build_proxy_options(proxy)
          response = HTTParty.get(url, options)

          if response.success?
            return response
          end
        end
      rescue *PROXY_ERRORS => e
        puts "Request failed with proxy #{proxy[:host]}:#{proxy[:port]} - #{e.message}"
        next
      end
    end

    raise "All proxies failed for #{url}"
  end

  private

  def build_proxy_options(proxy)
    options = {
      http_proxyaddr: proxy[:host],
      http_proxyport: proxy[:port],
      timeout: @timeout,
      headers: {
        'User-Agent' => 'Mozilla/5.0 (compatible; Ruby scraper)'
      }
    }

    if proxy[:username]
      options[:http_proxyuser] = proxy[:username]
      options[:http_proxypass] = proxy[:password]
    end

    options
  end
end

Testing Proxy Configuration

Always test your proxy setup before running large scraping operations:

require 'httparty'

def test_proxy(proxy_config)
  puts "Testing proxy: #{proxy_config[:host]}:#{proxy_config[:port]}"

  # Test basic connectivity
  test_urls = [
    'https://httpbin.org/ip',           # Shows your IP
    'https://httpbin.org/user-agent',   # Shows your user agent
    'https://httpbin.org/headers'       # Shows all headers
  ]

  test_urls.each do |url|
    begin
      options = {
        http_proxyaddr: proxy_config[:host],
        http_proxyport: proxy_config[:port],
        timeout: 10
      }

      if proxy_config[:username]
        options[:http_proxyuser] = proxy_config[:username]
        options[:http_proxypass] = proxy_config[:password]
      end

      response = HTTParty.get(url, options)

      if response.success?
        puts "✓ #{url} - Success"
        puts "Response: #{response.body[0..200]}..."
      else
        puts "✗ #{url} - HTTP #{response.code}"
      end

    rescue => e
      puts "✗ #{url} - Error: #{e.message}"
    end

    sleep(1) # Be respectful to test endpoints
  end
end

# Test your proxy
proxy = {
  host: 'your-proxy.com',
  port: 8080,
  username: 'your-username',
  password: 'your-password'
}

test_proxy(proxy)

Best Practices for Proxy-Based Scraping

1. Respect Rate Limits

class RateLimitedScraper
  def initialize(requests_per_minute: 60)
    @requests_per_minute = requests_per_minute
    @last_request_time = Time.now
  end

  def make_request(url, options = {})
    enforce_rate_limit
    HTTParty.get(url, options)
  end

  private

  def enforce_rate_limit
    time_since_last = Time.now - @last_request_time
    min_interval = 60.0 / @requests_per_minute

    if time_since_last < min_interval
      sleep(min_interval - time_since_last)
    end

    @last_request_time = Time.now
  end
end

2. Monitor Proxy Health

class ProxyHealthMonitor
  def initialize(proxies)
    @proxies = proxies
    @proxy_stats = {}
    initialize_stats
  end

  def record_result(proxy, success)
    key = "#{proxy[:host]}:#{proxy[:port]}"
    @proxy_stats[key][:total] += 1
    @proxy_stats[key][:success] += 1 if success
    @proxy_stats[key][:last_used] = Time.now
  end

  def get_best_proxy
    @proxies.max_by do |proxy|
      key = "#{proxy[:host]}:#{proxy[:port]}"
      stats = @proxy_stats[key]
      success_rate = stats[:success].to_f / [stats[:total], 1].max
      success_rate
    end
  end

  def print_stats
    @proxy_stats.each do |key, stats|
      success_rate = (stats[:success].to_f / [stats[:total], 1].max * 100).round(2)
      puts "#{key}: #{success_rate}% success (#{stats[:success]}/#{stats[:total]})"
    end
  end

  private

  def initialize_stats
    @proxies.each do |proxy|
      key = "#{proxy[:host]}:#{proxy[:port]}"
      @proxy_stats[key] = { total: 0, success: 0, last_used: nil }
    end
  end
end

Integration with Popular Scraping Libraries

When working with more complex scraping scenarios, you might need to integrate proxies with browser automation tools. While this article focuses on Ruby's HTTP libraries, understanding how to handle browser sessions in Puppeteer can be valuable for JavaScript-heavy sites that require proxy support through headless browsers.

For comprehensive web scraping solutions that handle proxy management automatically, consider using dedicated scraping APIs that rotate proxies and handle anti-bot measures transparently.

Conclusion

Using proxy servers with Ruby for web scraping requires careful consideration of authentication, error handling, and rotation strategies. The examples provided demonstrate various approaches from basic proxy usage with Net::HTTP to sophisticated proxy management systems with health monitoring and automatic failover.

Remember to always respect website terms of service, implement appropriate rate limiting, and consider the legal implications of your scraping activities. Proper proxy usage not only helps avoid technical blocks but also demonstrates responsible scraping practices.

Start with simple proxy configurations and gradually implement more advanced features like rotation and health monitoring as your scraping requirements grow. This approach ensures reliable data collection while maintaining good relationships with target websites and proxy providers.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon