How do I scrape data from websites using proxy servers with Ruby?
Using proxy servers is essential for web scraping when you need to avoid IP blocking, access geo-restricted content, or distribute requests across multiple IP addresses. Ruby provides several libraries and techniques to implement proxy support in your web scraping projects.
Why Use Proxy Servers for Web Scraping?
Proxy servers act as intermediaries between your scraper and the target website, offering several benefits:
- IP rotation: Distribute requests across multiple IP addresses to avoid rate limiting
- Geographic diversity: Access content restricted to specific regions
- Anonymity: Hide your real IP address from target websites
- Load distribution: Prevent overwhelming a single IP with too many requests
- Bypass blocking: Circumvent IP-based blocking mechanisms
Basic Proxy Setup with Net::HTTP
Ruby's built-in Net::HTTP
library provides native proxy support:
require 'net/http'
require 'uri'
# Define proxy configuration
proxy_host = 'proxy.example.com'
proxy_port = 8080
proxy_user = 'username' # Optional
proxy_pass = 'password' # Optional
# Target URL
target_url = 'https://httpbin.org/ip'
uri = URI(target_url)
# Create HTTP connection through proxy
http = Net::HTTP.new(uri.host, uri.port, proxy_host, proxy_port, proxy_user, proxy_pass)
http.use_ssl = true if uri.scheme == 'https'
# Make request
request = Net::HTTP::Get.new(uri)
response = http.request(request)
puts response.body
Using HTTParty with Proxies
HTTParty is a popular Ruby gem that simplifies HTTP requests and provides clean proxy support:
require 'httparty'
class ProxyClient
include HTTParty
# Set default options including proxy
default_options.update(
http_proxyaddr: 'proxy.example.com',
http_proxyport: 8080,
http_proxyuser: 'username',
http_proxypass: 'password'
)
end
# Make request through proxy
response = ProxyClient.get('https://httpbin.org/ip')
puts response.body
For dynamic proxy configuration:
require 'httparty'
def scrape_with_proxy(url, proxy_config)
options = {
http_proxyaddr: proxy_config[:host],
http_proxyport: proxy_config[:port]
}
# Add authentication if provided
if proxy_config[:username]
options[:http_proxyuser] = proxy_config[:username]
options[:http_proxypass] = proxy_config[:password]
end
HTTParty.get(url, options)
end
# Usage
proxy = {
host: 'proxy.example.com',
port: 8080,
username: 'user',
password: 'pass'
}
response = scrape_with_proxy('https://httpbin.org/ip', proxy)
puts response.body
Advanced Proxy Management with Faraday
Faraday offers more sophisticated HTTP client capabilities with excellent proxy support:
require 'faraday'
# Create connection with proxy
conn = Faraday.new do |f|
f.proxy = {
uri: 'http://username:password@proxy.example.com:8080'
}
f.adapter Faraday.default_adapter
end
# Make request
response = conn.get('https://httpbin.org/ip')
puts response.body
For SOCKS proxy support with Faraday:
require 'faraday'
require 'socksify/http'
# Configure SOCKS proxy
conn = Faraday.new do |f|
f.proxy = {
uri: 'socks5://127.0.0.1:1080'
}
f.adapter :net_http_socks
end
response = conn.get('https://httpbin.org/ip')
puts response.body
Implementing Proxy Rotation
Rotating proxies helps distribute load and avoid detection:
require 'httparty'
class ProxyRotator
def initialize(proxies)
@proxies = proxies
@current_index = 0
end
def next_proxy
proxy = @proxies[@current_index]
@current_index = (@current_index + 1) % @proxies.length
proxy
end
def make_request(url, max_retries: 3)
retries = 0
while retries < max_retries
proxy = next_proxy
begin
options = {
http_proxyaddr: proxy[:host],
http_proxyport: proxy[:port],
timeout: 10
}
# Add authentication if available
if proxy[:username]
options[:http_proxyuser] = proxy[:username]
options[:http_proxypass] = proxy[:password]
end
response = HTTParty.get(url, options)
if response.success?
return response
else
raise "HTTP Error: #{response.code}"
end
rescue => e
puts "Request failed with proxy #{proxy[:host]}:#{proxy[:port]} - #{e.message}"
retries += 1
sleep(1) # Brief delay before retry
end
end
raise "All proxy attempts failed for #{url}"
end
end
# Define proxy pool
proxies = [
{ host: 'proxy1.example.com', port: 8080, username: 'user1', password: 'pass1' },
{ host: 'proxy2.example.com', port: 8080, username: 'user2', password: 'pass2' },
{ host: 'proxy3.example.com', port: 8080, username: 'user3', password: 'pass3' }
]
# Usage
rotator = ProxyRotator.new(proxies)
response = rotator.make_request('https://httpbin.org/ip')
puts response.body
Handling Proxy Authentication
Different proxy types require various authentication methods:
Basic HTTP Authentication
require 'httparty'
def request_with_basic_auth(url, proxy_host, proxy_port, username, password)
options = {
http_proxyaddr: proxy_host,
http_proxyport: proxy_port,
http_proxyuser: username,
http_proxypass: password,
headers: {
'User-Agent' => 'Mozilla/5.0 (compatible; Ruby scraper)'
}
}
HTTParty.get(url, options)
end
Custom Headers for Proxy Authentication
require 'faraday'
require 'base64'
def request_with_custom_auth(url, proxy_uri, auth_token)
conn = Faraday.new do |f|
f.proxy = proxy_uri
f.adapter Faraday.default_adapter
end
# Add custom authentication header
auth_header = "Bearer #{auth_token}"
conn.get(url) do |req|
req.headers['Proxy-Authorization'] = auth_header
req.headers['User-Agent'] = 'Ruby/Faraday Scraper'
end
end
Error Handling and Proxy Validation
Implement robust error handling for proxy-related issues:
require 'httparty'
require 'timeout'
class RobustProxyScraper
PROXY_ERRORS = [
Net::TimeoutError,
Errno::ECONNREFUSED,
Errno::ECONNRESET,
HTTParty::Error,
SocketError
].freeze
def initialize(proxies, timeout: 30)
@proxies = proxies
@timeout = timeout
@working_proxies = []
validate_proxies
end
def validate_proxies
@proxies.each do |proxy|
if proxy_working?(proxy)
@working_proxies << proxy
puts "✓ Proxy #{proxy[:host]}:#{proxy[:port]} is working"
else
puts "✗ Proxy #{proxy[:host]}:#{proxy[:port]} failed validation"
end
end
raise "No working proxies found" if @working_proxies.empty?
end
def proxy_working?(proxy)
begin
Timeout::timeout(@timeout) do
options = build_proxy_options(proxy)
response = HTTParty.get('https://httpbin.org/ip', options)
response.success?
end
rescue *PROXY_ERRORS => e
puts "Proxy validation error: #{e.message}"
false
end
end
def scrape(url)
@working_proxies.each do |proxy|
begin
Timeout::timeout(@timeout) do
options = build_proxy_options(proxy)
response = HTTParty.get(url, options)
if response.success?
return response
end
end
rescue *PROXY_ERRORS => e
puts "Request failed with proxy #{proxy[:host]}:#{proxy[:port]} - #{e.message}"
next
end
end
raise "All proxies failed for #{url}"
end
private
def build_proxy_options(proxy)
options = {
http_proxyaddr: proxy[:host],
http_proxyport: proxy[:port],
timeout: @timeout,
headers: {
'User-Agent' => 'Mozilla/5.0 (compatible; Ruby scraper)'
}
}
if proxy[:username]
options[:http_proxyuser] = proxy[:username]
options[:http_proxypass] = proxy[:password]
end
options
end
end
Testing Proxy Configuration
Always test your proxy setup before running large scraping operations:
require 'httparty'
def test_proxy(proxy_config)
puts "Testing proxy: #{proxy_config[:host]}:#{proxy_config[:port]}"
# Test basic connectivity
test_urls = [
'https://httpbin.org/ip', # Shows your IP
'https://httpbin.org/user-agent', # Shows your user agent
'https://httpbin.org/headers' # Shows all headers
]
test_urls.each do |url|
begin
options = {
http_proxyaddr: proxy_config[:host],
http_proxyport: proxy_config[:port],
timeout: 10
}
if proxy_config[:username]
options[:http_proxyuser] = proxy_config[:username]
options[:http_proxypass] = proxy_config[:password]
end
response = HTTParty.get(url, options)
if response.success?
puts "✓ #{url} - Success"
puts "Response: #{response.body[0..200]}..."
else
puts "✗ #{url} - HTTP #{response.code}"
end
rescue => e
puts "✗ #{url} - Error: #{e.message}"
end
sleep(1) # Be respectful to test endpoints
end
end
# Test your proxy
proxy = {
host: 'your-proxy.com',
port: 8080,
username: 'your-username',
password: 'your-password'
}
test_proxy(proxy)
Best Practices for Proxy-Based Scraping
1. Respect Rate Limits
class RateLimitedScraper
def initialize(requests_per_minute: 60)
@requests_per_minute = requests_per_minute
@last_request_time = Time.now
end
def make_request(url, options = {})
enforce_rate_limit
HTTParty.get(url, options)
end
private
def enforce_rate_limit
time_since_last = Time.now - @last_request_time
min_interval = 60.0 / @requests_per_minute
if time_since_last < min_interval
sleep(min_interval - time_since_last)
end
@last_request_time = Time.now
end
end
2. Monitor Proxy Health
class ProxyHealthMonitor
def initialize(proxies)
@proxies = proxies
@proxy_stats = {}
initialize_stats
end
def record_result(proxy, success)
key = "#{proxy[:host]}:#{proxy[:port]}"
@proxy_stats[key][:total] += 1
@proxy_stats[key][:success] += 1 if success
@proxy_stats[key][:last_used] = Time.now
end
def get_best_proxy
@proxies.max_by do |proxy|
key = "#{proxy[:host]}:#{proxy[:port]}"
stats = @proxy_stats[key]
success_rate = stats[:success].to_f / [stats[:total], 1].max
success_rate
end
end
def print_stats
@proxy_stats.each do |key, stats|
success_rate = (stats[:success].to_f / [stats[:total], 1].max * 100).round(2)
puts "#{key}: #{success_rate}% success (#{stats[:success]}/#{stats[:total]})"
end
end
private
def initialize_stats
@proxies.each do |proxy|
key = "#{proxy[:host]}:#{proxy[:port]}"
@proxy_stats[key] = { total: 0, success: 0, last_used: nil }
end
end
end
Integration with Popular Scraping Libraries
When working with more complex scraping scenarios, you might need to integrate proxies with browser automation tools. While this article focuses on Ruby's HTTP libraries, understanding how to handle browser sessions in Puppeteer can be valuable for JavaScript-heavy sites that require proxy support through headless browsers.
For comprehensive web scraping solutions that handle proxy management automatically, consider using dedicated scraping APIs that rotate proxies and handle anti-bot measures transparently.
Conclusion
Using proxy servers with Ruby for web scraping requires careful consideration of authentication, error handling, and rotation strategies. The examples provided demonstrate various approaches from basic proxy usage with Net::HTTP to sophisticated proxy management systems with health monitoring and automatic failover.
Remember to always respect website terms of service, implement appropriate rate limiting, and consider the legal implications of your scraping activities. Proper proxy usage not only helps avoid technical blocks but also demonstrates responsible scraping practices.
Start with simple proxy configurations and gradually implement more advanced features like rotation and health monitoring as your scraping requirements grow. This approach ensures reliable data collection while maintaining good relationships with target websites and proxy providers.