What Proxy Configuration Options Are Available in Mechanize?

Mechanize provides comprehensive proxy support for web scraping, allowing you to route requests through proxy servers for anonymity, bypassing geographical restrictions, or distributing load. This guide covers all proxy configuration options available in the Mechanize library for Ruby.

Basic HTTP Proxy Configuration

The most straightforward way to configure a proxy in Mechanize is using the set_proxy method:

require 'mechanize'

agent = Mechanize.new
agent.set_proxy('proxy.example.com', 8080)

# Now all requests will go through the proxy
page = agent.get('https://httpbin.org/ip')
puts page.body

You can also configure the proxy during initialization:

agent = Mechanize.new do |a|
  a.set_proxy('proxy.example.com', 8080)
end

HTTPS Proxy Configuration

For HTTPS proxies, you can specify the protocol explicitly:

agent = Mechanize.new
agent.set_proxy('https://secure-proxy.example.com', 8080)

# Or using the full proxy URL
agent.set_proxy('https://secure-proxy.example.com:8080')

Proxy Authentication

Many proxy servers require authentication. Mechanize supports both basic and digest authentication:

require 'mechanize'

agent = Mechanize.new

# Basic authentication with username and password
agent.set_proxy('proxy.example.com', 8080, 'username', 'password')

# Alternative syntax
agent.proxy_addr = 'proxy.example.com'
agent.proxy_port = 8080
agent.proxy_user = 'username'
agent.proxy_pass = 'password'

SOCKS Proxy Support

Mechanize can work with SOCKS proxies through the underlying Net::HTTP library:

require 'mechanize'
require 'socksify'

# Configure SOCKS proxy
Socksify::debug = true
TCPSocket::socks_server = "127.0.0.1"
TCPSocket::socks_port = 1080

agent = Mechanize.new
page = agent.get('https://httpbin.org/ip')

Advanced Proxy Configuration

Setting Proxy Per Request

You can configure different proxies for different requests:

agent = Mechanize.new

# Use proxy for specific request
agent.set_proxy('proxy1.example.com', 8080)
page1 = agent.get('https://example.com')

# Change proxy for next request
agent.set_proxy('proxy2.example.com', 8080)
page2 = agent.get('https://another-site.com')

# Disable proxy
agent.set_proxy(nil)
page3 = agent.get('https://direct-connection.com')

Proxy with Custom Headers

You can combine proxy usage with custom headers for better anonymity:

agent = Mechanize.new
agent.set_proxy('proxy.example.com', 8080, 'user', 'pass')

# Set custom headers
agent.user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
agent.request_headers = {
  'Accept-Language' => 'en-US,en;q=0.9',
  'Accept-Encoding' => 'gzip, deflate, br'
}

page = agent.get('https://target-site.com')

Proxy Rotation Strategy

For large-scale scraping, implementing proxy rotation is crucial:

class ProxyRotator
  def initialize(proxies)
    @proxies = proxies
    @current_index = 0
  end

  def next_proxy
    proxy = @proxies[@current_index]
    @current_index = (@current_index + 1) % @proxies.size
    proxy
  end
end

# Define proxy list
proxies = [
  { host: 'proxy1.example.com', port: 8080, user: 'user1', pass: 'pass1' },
  { host: 'proxy2.example.com', port: 8080, user: 'user2', pass: 'pass2' },
  { host: 'proxy3.example.com', port: 8080, user: 'user3', pass: 'pass3' }
]

rotator = ProxyRotator.new(proxies)
agent = Mechanize.new

urls = ['https://site1.com', 'https://site2.com', 'https://site3.com']

urls.each do |url|
  proxy = rotator.next_proxy
  agent.set_proxy(proxy[:host], proxy[:port], proxy[:user], proxy[:pass])

  begin
    page = agent.get(url)
    puts "Successfully scraped #{url} using #{proxy[:host]}"
  rescue => e
    puts "Error with #{proxy[:host]}: #{e.message}"
  end
end

Error Handling and Proxy Validation

Robust proxy handling includes error management and validation:

def test_proxy(host, port, user = nil, pass = nil)
  agent = Mechanize.new
  agent.read_timeout = 10
  agent.open_timeout = 10

  begin
    agent.set_proxy(host, port, user, pass)
    response = agent.get('https://httpbin.org/ip')

    if response.code == '200'
      puts "Proxy #{host}:#{port} is working"
      return true
    else
      puts "Proxy #{host}:#{port} returned status #{response.code}"
      return false
    end
  rescue => e
    puts "Proxy #{host}:#{port} failed: #{e.message}"
    return false
  end
end

# Test multiple proxies
proxies = [
  ['proxy1.example.com', 8080],
  ['proxy2.example.com', 8080],
  ['proxy3.example.com', 8080]
]

working_proxies = proxies.select { |host, port| test_proxy(host, port) }
puts "Found #{working_proxies.size} working proxies"

Environment-Based Proxy Configuration

You can configure proxies based on environment variables for flexibility:

require 'mechanize'

agent = Mechanize.new

# Configure proxy from environment variables
if ENV['HTTP_PROXY']
  proxy_uri = URI.parse(ENV['HTTP_PROXY'])
  agent.set_proxy(
    proxy_uri.host,
    proxy_uri.port,
    proxy_uri.user,
    proxy_uri.password
  )
end

# Alternative using standard HTTP_PROXY environment
agent = Mechanize.new
# Mechanize automatically respects HTTP_PROXY environment variable

Proxy Configuration for Different Protocols

Handle different protocols with appropriate proxy settings:

class MultiProtocolScraper
  def initialize
    @agent = Mechanize.new
    @agent.verify_mode = OpenSSL::SSL::VERIFY_NONE # For development only
  end

  def configure_proxy_for_protocol(url)
    uri = URI.parse(url)

    case uri.scheme
    when 'https'
      @agent.set_proxy('https-proxy.example.com', 8080)
    when 'http'
      @agent.set_proxy('http-proxy.example.com', 8080)
    when 'ftp'
      # Handle FTP proxy if needed
      @agent.set_proxy('ftp-proxy.example.com', 21)
    end
  end

  def scrape(url)
    configure_proxy_for_protocol(url)
    @agent.get(url)
  end
end

scraper = MultiProtocolScraper.new
page = scraper.scrape('https://secure-site.com')

Best Practices for Proxy Usage

1. Connection Pooling with Proxies

class PooledProxyAgent
  def initialize(proxies, pool_size = 5)
    @proxies = proxies
    @agents = Array.new(pool_size) do
      agent = Mechanize.new
      proxy = @proxies.sample
      agent.set_proxy(proxy[:host], proxy[:port], proxy[:user], proxy[:pass])
      agent
    end
    @current_agent = 0
  end

  def get(url)
    agent = @agents[@current_agent]
    @current_agent = (@current_agent + 1) % @agents.size
    agent.get(url)
  end
end

2. Proxy Health Monitoring

class ProxyMonitor
  def initialize(proxies)
    @proxies = proxies
    @healthy_proxies = []
    check_proxy_health
  end

  def check_proxy_health
    @healthy_proxies = @proxies.select do |proxy|
      test_proxy(proxy[:host], proxy[:port], proxy[:user], proxy[:pass])
    end
  end

  def get_healthy_proxy
    @healthy_proxies.sample
  end

  private

  def test_proxy(host, port, user, pass)
    # Implementation from earlier example
    # Returns true if proxy is working
  end
end

When working with complex web scraping scenarios, you might also need to consider handling authentication in browser automation tools or managing network requests effectively for comprehensive scraping strategies.

Troubleshooting Common Proxy Issues

Connection Timeouts

agent = Mechanize.new
agent.open_timeout = 30    # Connection timeout
agent.read_timeout = 60    # Read timeout
agent.set_proxy('slow-proxy.example.com', 8080)

SSL Certificate Issues

agent = Mechanize.new
agent.verify_mode = OpenSSL::SSL::VERIFY_NONE  # Use cautiously
agent.cert_store = OpenSSL::X509::Store.new
agent.cert_store.set_default_paths

Proxy Authentication Failures

begin
  agent.set_proxy('proxy.example.com', 8080, 'user', 'wrongpass')
  page = agent.get('https://httpbin.org/ip')
rescue Mechanize::UnauthorizedError => e
  puts "Proxy authentication failed: #{e.message}"
  # Implement retry logic or switch proxy
end

Conclusion

Mechanize offers flexible proxy configuration options suitable for various web scraping scenarios. From basic HTTP proxies to advanced rotation strategies, proper proxy configuration is essential for successful large-scale scraping operations. Remember to always respect website terms of service and implement appropriate rate limiting when using proxies.

The key is to combine proxy usage with other scraping best practices like proper error handling, request throttling, and respectful scraping patterns to build robust and reliable web scraping applications.

Table of contents

What Proxy Configuration Options Are Available in Mechanize?

Basic HTTP Proxy Configuration

HTTPS Proxy Configuration

Proxy Authentication

SOCKS Proxy Support

Advanced Proxy Configuration

Setting Proxy Per Request

Proxy with Custom Headers

Proxy Rotation Strategy

Error Handling and Proxy Validation

Environment-Based Proxy Configuration

Proxy Configuration for Different Protocols

Best Practices for Proxy Usage

1. Connection Pooling with Proxies

2. Proxy Health Monitoring

Troubleshooting Common Proxy Issues

Connection Timeouts

SSL Certificate Issues

Proxy Authentication Failures

Conclusion

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Related Questions

How do you implement retry logic for failed requests in Mechanize?

What are the different types of forms that Mechanize can handle?

How do you extract data from HTML tables using Mechanize?

Get Started Now

Support