How do I set up a proxy with MechanicalSoup?
Setting up a proxy with MechanicalSoup is essential for many web scraping scenarios, whether you need to bypass geographical restrictions, distribute requests across multiple IP addresses, or simply route traffic through a corporate proxy. MechanicalSoup, being built on top of the Requests library, provides flexible proxy configuration options that support both HTTP and SOCKS proxies.
Understanding Proxy Configuration in MechanicalSoup
MechanicalSoup uses the underlying Requests library for HTTP communications, which means proxy configuration follows the same patterns. You can configure proxies at the browser level during initialization or modify proxy settings for specific requests.
Basic Proxy Setup
HTTP Proxy Configuration
The most straightforward way to set up a proxy with MechanicalSoup is during browser initialization:
import mechanicalsoup
# Basic HTTP proxy setup
browser = mechanicalsoup.StatefulBrowser()
browser.session.proxies = {
'http': 'http://proxy-server.com:8080',
'https': 'http://proxy-server.com:8080'
}
# Navigate to a website through the proxy
browser.open('https://httpbin.org/ip')
print(browser.page.text)
HTTPS Proxy Configuration
For HTTPS proxies, specify the protocol explicitly:
import mechanicalsoup
browser = mechanicalsoup.StatefulBrowser()
browser.session.proxies = {
'http': 'https://secure-proxy.com:8443',
'https': 'https://secure-proxy.com:8443'
}
browser.open('https://example.com')
Proxy Authentication
Many proxy servers require authentication. MechanicalSoup supports both basic authentication embedded in the URL and more sophisticated authentication methods.
URL-Based Authentication
import mechanicalsoup
# Proxy with embedded credentials
proxy_url = 'http://username:password@proxy-server.com:8080'
browser = mechanicalsoup.StatefulBrowser()
browser.session.proxies = {
'http': proxy_url,
'https': proxy_url
}
browser.open('https://httpbin.org/ip')
Using requests.auth for Advanced Authentication
For more complex authentication scenarios:
import mechanicalsoup
import requests
from requests.auth import HTTPProxyAuth
browser = mechanicalsoup.StatefulBrowser()
# Set proxy without embedded credentials
browser.session.proxies = {
'http': 'http://proxy-server.com:8080',
'https': 'http://proxy-server.com:8080'
}
# Configure proxy authentication
browser.session.auth = HTTPProxyAuth('username', 'password')
browser.open('https://example.com')
SOCKS Proxy Configuration
SOCKS proxies require additional dependencies. Install the required package first:
pip install requests[socks]
Then configure the SOCKS proxy:
import mechanicalsoup
browser = mechanicalsoup.StatefulBrowser()
browser.session.proxies = {
'http': 'socks5://socks-proxy.com:1080',
'https': 'socks5://socks-proxy.com:1080'
}
browser.open('https://example.com')
For SOCKS proxies with authentication:
import mechanicalsoup
browser = mechanicalsoup.StatefulBrowser()
browser.session.proxies = {
'http': 'socks5://username:password@socks-proxy.com:1080',
'https': 'socks5://username:password@socks-proxy.com:1080'
}
browser.open('https://example.com')
Advanced Proxy Configuration
Environment Variable Configuration
You can also configure proxies using environment variables, which is useful for deployment scenarios:
export HTTP_PROXY=http://proxy-server.com:8080
export HTTPS_PROXY=http://proxy-server.com:8080
import mechanicalsoup
import os
browser = mechanicalsoup.StatefulBrowser()
# MechanicalSoup will automatically use environment variables
browser.open('https://example.com')
Custom Session Configuration
For more granular control, you can create a custom session with specific proxy configurations:
import mechanicalsoup
import requests
# Create a custom session
session = requests.Session()
session.proxies = {
'http': 'http://proxy1.com:8080',
'https': 'http://proxy2.com:8080'
}
# Set additional session parameters
session.headers.update({
'User-Agent': 'Mozilla/5.0 (compatible; MechanicalSoup)'
})
# Create browser with custom session
browser = mechanicalsoup.StatefulBrowser(session=session)
browser.open('https://example.com')
Proxy Rotation and Load Balancing
For large-scale scraping operations, you might want to rotate between multiple proxies:
import mechanicalsoup
import random
class ProxyRotator:
def __init__(self, proxy_list):
self.proxies = proxy_list
self.current_proxy = None
def get_random_proxy(self):
proxy = random.choice(self.proxies)
self.current_proxy = proxy
return {
'http': proxy,
'https': proxy
}
def create_browser(self):
browser = mechanicalsoup.StatefulBrowser()
browser.session.proxies = self.get_random_proxy()
return browser
# Usage example
proxy_list = [
'http://proxy1.com:8080',
'http://proxy2.com:8080',
'http://proxy3.com:8080'
]
rotator = ProxyRotator(proxy_list)
# Use different proxy for each request
for url in ['https://example1.com', 'https://example2.com']:
browser = rotator.create_browser()
browser.open(url)
print(f"Used proxy: {rotator.current_proxy}")
print(f"Response: {browser.page.title}")
Error Handling and Troubleshooting
Proxy connections can fail for various reasons. Implement proper error handling:
import mechanicalsoup
import requests
from requests.exceptions import ProxyError, ConnectTimeout
def safe_browse_with_proxy(url, proxy_config, timeout=30):
browser = mechanicalsoup.StatefulBrowser()
browser.session.proxies = proxy_config
browser.session.timeout = timeout
try:
browser.open(url)
return browser.page
except ProxyError as e:
print(f"Proxy error: {e}")
return None
except ConnectTimeout as e:
print(f"Connection timeout: {e}")
return None
except Exception as e:
print(f"Unexpected error: {e}")
return None
# Usage
proxy_config = {
'http': 'http://proxy-server.com:8080',
'https': 'http://proxy-server.com:8080'
}
page = safe_browse_with_proxy('https://example.com', proxy_config)
if page:
print("Successfully retrieved page through proxy")
else:
print("Failed to retrieve page")
Testing Proxy Configuration
Always verify that your proxy is working correctly:
import mechanicalsoup
import json
def test_proxy_configuration(proxy_config):
browser = mechanicalsoup.StatefulBrowser()
browser.session.proxies = proxy_config
try:
# Test with a service that returns IP information
browser.open('https://httpbin.org/ip')
response_data = json.loads(browser.page.text)
print(f"External IP: {response_data['origin']}")
# Test user agent
browser.open('https://httpbin.org/user-agent')
ua_data = json.loads(browser.page.text)
print(f"User Agent: {ua_data['user-agent']}")
return True
except Exception as e:
print(f"Proxy test failed: {e}")
return False
# Test your proxy
proxy_config = {
'http': 'http://your-proxy.com:8080',
'https': 'http://your-proxy.com:8080'
}
if test_proxy_configuration(proxy_config):
print("Proxy configuration is working correctly")
else:
print("Proxy configuration needs adjustment")
Performance Considerations
When using proxies with MechanicalSoup, consider these performance factors:
Connection Pooling
import mechanicalsoup
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
browser = mechanicalsoup.StatefulBrowser()
# Configure retry strategy
retry_strategy = Retry(
total=3,
status_forcelist=[429, 500, 502, 503, 504],
method_whitelist=["HEAD", "GET", "OPTIONS"]
)
# Mount adapter with connection pooling
adapter = HTTPAdapter(
pool_connections=10,
pool_maxsize=20,
max_retries=retry_strategy
)
browser.session.mount("http://", adapter)
browser.session.mount("https://", adapter)
# Set proxy
browser.session.proxies = {
'http': 'http://proxy-server.com:8080',
'https': 'http://proxy-server.com:8080'
}
Timeout Configuration
import mechanicalsoup
browser = mechanicalsoup.StatefulBrowser()
browser.session.proxies = {
'http': 'http://proxy-server.com:8080',
'https': 'http://proxy-server.com:8080'
}
# Set reasonable timeouts
browser.session.timeout = (5, 30) # (connect_timeout, read_timeout)
Best Practices
- Always test proxy configuration before running production scraping tasks
- Implement proper error handling for proxy failures
- Use connection pooling for better performance with multiple requests
- Rotate proxies to distribute load and avoid rate limiting
- Monitor proxy performance and replace slow or unreliable proxies
- Respect proxy provider's terms of service and usage limits
Similar to how you might handle authentication in Puppeteer for browser-based scraping, proxy authentication in MechanicalSoup requires careful consideration of security and session management.
Conclusion
Setting up a proxy with MechanicalSoup is straightforward and provides essential functionality for robust web scraping applications. Whether you need basic HTTP proxy support or advanced SOCKS proxy configuration with authentication, MechanicalSoup's integration with the Requests library offers flexible options to meet your requirements.
Remember to always test your proxy configuration thoroughly and implement proper error handling to ensure reliable operation. For large-scale operations, consider implementing proxy rotation and monitoring to maintain optimal performance and avoid service disruptions.