What Proxy Configuration Options Are Available in Mechanize?
Mechanize provides comprehensive proxy support for web scraping, allowing you to route requests through proxy servers for anonymity, bypassing geographical restrictions, or distributing load. This guide covers all proxy configuration options available in the Mechanize library for Ruby.
Basic HTTP Proxy Configuration
The most straightforward way to configure a proxy in Mechanize is using the set_proxy
method:
require 'mechanize'
agent = Mechanize.new
agent.set_proxy('proxy.example.com', 8080)
# Now all requests will go through the proxy
page = agent.get('https://httpbin.org/ip')
puts page.body
You can also configure the proxy during initialization:
agent = Mechanize.new do |a|
a.set_proxy('proxy.example.com', 8080)
end
HTTPS Proxy Configuration
For HTTPS proxies, you can specify the protocol explicitly:
agent = Mechanize.new
agent.set_proxy('https://secure-proxy.example.com', 8080)
# Or using the full proxy URL
agent.set_proxy('https://secure-proxy.example.com:8080')
Proxy Authentication
Many proxy servers require authentication. Mechanize supports both basic and digest authentication:
require 'mechanize'
agent = Mechanize.new
# Basic authentication with username and password
agent.set_proxy('proxy.example.com', 8080, 'username', 'password')
# Alternative syntax
agent.proxy_addr = 'proxy.example.com'
agent.proxy_port = 8080
agent.proxy_user = 'username'
agent.proxy_pass = 'password'
SOCKS Proxy Support
Mechanize can work with SOCKS proxies through the underlying Net::HTTP library:
require 'mechanize'
require 'socksify'
# Configure SOCKS proxy
Socksify::debug = true
TCPSocket::socks_server = "127.0.0.1"
TCPSocket::socks_port = 1080
agent = Mechanize.new
page = agent.get('https://httpbin.org/ip')
Advanced Proxy Configuration
Setting Proxy Per Request
You can configure different proxies for different requests:
agent = Mechanize.new
# Use proxy for specific request
agent.set_proxy('proxy1.example.com', 8080)
page1 = agent.get('https://example.com')
# Change proxy for next request
agent.set_proxy('proxy2.example.com', 8080)
page2 = agent.get('https://another-site.com')
# Disable proxy
agent.set_proxy(nil)
page3 = agent.get('https://direct-connection.com')
Proxy with Custom Headers
You can combine proxy usage with custom headers for better anonymity:
agent = Mechanize.new
agent.set_proxy('proxy.example.com', 8080, 'user', 'pass')
# Set custom headers
agent.user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
agent.request_headers = {
'Accept-Language' => 'en-US,en;q=0.9',
'Accept-Encoding' => 'gzip, deflate, br'
}
page = agent.get('https://target-site.com')
Proxy Rotation Strategy
For large-scale scraping, implementing proxy rotation is crucial:
class ProxyRotator
def initialize(proxies)
@proxies = proxies
@current_index = 0
end
def next_proxy
proxy = @proxies[@current_index]
@current_index = (@current_index + 1) % @proxies.size
proxy
end
end
# Define proxy list
proxies = [
{ host: 'proxy1.example.com', port: 8080, user: 'user1', pass: 'pass1' },
{ host: 'proxy2.example.com', port: 8080, user: 'user2', pass: 'pass2' },
{ host: 'proxy3.example.com', port: 8080, user: 'user3', pass: 'pass3' }
]
rotator = ProxyRotator.new(proxies)
agent = Mechanize.new
urls = ['https://site1.com', 'https://site2.com', 'https://site3.com']
urls.each do |url|
proxy = rotator.next_proxy
agent.set_proxy(proxy[:host], proxy[:port], proxy[:user], proxy[:pass])
begin
page = agent.get(url)
puts "Successfully scraped #{url} using #{proxy[:host]}"
rescue => e
puts "Error with #{proxy[:host]}: #{e.message}"
end
end
Error Handling and Proxy Validation
Robust proxy handling includes error management and validation:
def test_proxy(host, port, user = nil, pass = nil)
agent = Mechanize.new
agent.read_timeout = 10
agent.open_timeout = 10
begin
agent.set_proxy(host, port, user, pass)
response = agent.get('https://httpbin.org/ip')
if response.code == '200'
puts "Proxy #{host}:#{port} is working"
return true
else
puts "Proxy #{host}:#{port} returned status #{response.code}"
return false
end
rescue => e
puts "Proxy #{host}:#{port} failed: #{e.message}"
return false
end
end
# Test multiple proxies
proxies = [
['proxy1.example.com', 8080],
['proxy2.example.com', 8080],
['proxy3.example.com', 8080]
]
working_proxies = proxies.select { |host, port| test_proxy(host, port) }
puts "Found #{working_proxies.size} working proxies"
Environment-Based Proxy Configuration
You can configure proxies based on environment variables for flexibility:
require 'mechanize'
agent = Mechanize.new
# Configure proxy from environment variables
if ENV['HTTP_PROXY']
proxy_uri = URI.parse(ENV['HTTP_PROXY'])
agent.set_proxy(
proxy_uri.host,
proxy_uri.port,
proxy_uri.user,
proxy_uri.password
)
end
# Alternative using standard HTTP_PROXY environment
agent = Mechanize.new
# Mechanize automatically respects HTTP_PROXY environment variable
Proxy Configuration for Different Protocols
Handle different protocols with appropriate proxy settings:
class MultiProtocolScraper
def initialize
@agent = Mechanize.new
@agent.verify_mode = OpenSSL::SSL::VERIFY_NONE # For development only
end
def configure_proxy_for_protocol(url)
uri = URI.parse(url)
case uri.scheme
when 'https'
@agent.set_proxy('https-proxy.example.com', 8080)
when 'http'
@agent.set_proxy('http-proxy.example.com', 8080)
when 'ftp'
# Handle FTP proxy if needed
@agent.set_proxy('ftp-proxy.example.com', 21)
end
end
def scrape(url)
configure_proxy_for_protocol(url)
@agent.get(url)
end
end
scraper = MultiProtocolScraper.new
page = scraper.scrape('https://secure-site.com')
Best Practices for Proxy Usage
1. Connection Pooling with Proxies
class PooledProxyAgent
def initialize(proxies, pool_size = 5)
@proxies = proxies
@agents = Array.new(pool_size) do
agent = Mechanize.new
proxy = @proxies.sample
agent.set_proxy(proxy[:host], proxy[:port], proxy[:user], proxy[:pass])
agent
end
@current_agent = 0
end
def get(url)
agent = @agents[@current_agent]
@current_agent = (@current_agent + 1) % @agents.size
agent.get(url)
end
end
2. Proxy Health Monitoring
class ProxyMonitor
def initialize(proxies)
@proxies = proxies
@healthy_proxies = []
check_proxy_health
end
def check_proxy_health
@healthy_proxies = @proxies.select do |proxy|
test_proxy(proxy[:host], proxy[:port], proxy[:user], proxy[:pass])
end
end
def get_healthy_proxy
@healthy_proxies.sample
end
private
def test_proxy(host, port, user, pass)
# Implementation from earlier example
# Returns true if proxy is working
end
end
When working with complex web scraping scenarios, you might also need to consider handling authentication in browser automation tools or managing network requests effectively for comprehensive scraping strategies.
Troubleshooting Common Proxy Issues
Connection Timeouts
agent = Mechanize.new
agent.open_timeout = 30 # Connection timeout
agent.read_timeout = 60 # Read timeout
agent.set_proxy('slow-proxy.example.com', 8080)
SSL Certificate Issues
agent = Mechanize.new
agent.verify_mode = OpenSSL::SSL::VERIFY_NONE # Use cautiously
agent.cert_store = OpenSSL::X509::Store.new
agent.cert_store.set_default_paths
Proxy Authentication Failures
begin
agent.set_proxy('proxy.example.com', 8080, 'user', 'wrongpass')
page = agent.get('https://httpbin.org/ip')
rescue Mechanize::UnauthorizedError => e
puts "Proxy authentication failed: #{e.message}"
# Implement retry logic or switch proxy
end
Conclusion
Mechanize offers flexible proxy configuration options suitable for various web scraping scenarios. From basic HTTP proxies to advanced rotation strategies, proper proxy configuration is essential for successful large-scale scraping operations. Remember to always respect website terms of service and implement appropriate rate limiting when using proxies.
The key is to combine proxy usage with other scraping best practices like proper error handling, request throttling, and respectful scraping patterns to build robust and reliable web scraping applications.