Table of contents

How do I set up a proxy with MechanicalSoup?

Setting up a proxy with MechanicalSoup is essential for many web scraping scenarios, whether you need to bypass geographical restrictions, distribute requests across multiple IP addresses, or simply route traffic through a corporate proxy. MechanicalSoup, being built on top of the Requests library, provides flexible proxy configuration options that support both HTTP and SOCKS proxies.

Understanding Proxy Configuration in MechanicalSoup

MechanicalSoup uses the underlying Requests library for HTTP communications, which means proxy configuration follows the same patterns. You can configure proxies at the browser level during initialization or modify proxy settings for specific requests.

Basic Proxy Setup

HTTP Proxy Configuration

The most straightforward way to set up a proxy with MechanicalSoup is during browser initialization:

import mechanicalsoup

# Basic HTTP proxy setup
browser = mechanicalsoup.StatefulBrowser()
browser.session.proxies = {
    'http': 'http://proxy-server.com:8080',
    'https': 'http://proxy-server.com:8080'
}

# Navigate to a website through the proxy
browser.open('https://httpbin.org/ip')
print(browser.page.text)

HTTPS Proxy Configuration

For HTTPS proxies, specify the protocol explicitly:

import mechanicalsoup

browser = mechanicalsoup.StatefulBrowser()
browser.session.proxies = {
    'http': 'https://secure-proxy.com:8443',
    'https': 'https://secure-proxy.com:8443'
}

browser.open('https://example.com')

Proxy Authentication

Many proxy servers require authentication. MechanicalSoup supports both basic authentication embedded in the URL and more sophisticated authentication methods.

URL-Based Authentication

import mechanicalsoup

# Proxy with embedded credentials
proxy_url = 'http://username:password@proxy-server.com:8080'

browser = mechanicalsoup.StatefulBrowser()
browser.session.proxies = {
    'http': proxy_url,
    'https': proxy_url
}

browser.open('https://httpbin.org/ip')

Using requests.auth for Advanced Authentication

For more complex authentication scenarios:

import mechanicalsoup
import requests
from requests.auth import HTTPProxyAuth

browser = mechanicalsoup.StatefulBrowser()

# Set proxy without embedded credentials
browser.session.proxies = {
    'http': 'http://proxy-server.com:8080',
    'https': 'http://proxy-server.com:8080'
}

# Configure proxy authentication
browser.session.auth = HTTPProxyAuth('username', 'password')

browser.open('https://example.com')

SOCKS Proxy Configuration

SOCKS proxies require additional dependencies. Install the required package first:

pip install requests[socks]

Then configure the SOCKS proxy:

import mechanicalsoup

browser = mechanicalsoup.StatefulBrowser()
browser.session.proxies = {
    'http': 'socks5://socks-proxy.com:1080',
    'https': 'socks5://socks-proxy.com:1080'
}

browser.open('https://example.com')

For SOCKS proxies with authentication:

import mechanicalsoup

browser = mechanicalsoup.StatefulBrowser()
browser.session.proxies = {
    'http': 'socks5://username:password@socks-proxy.com:1080',
    'https': 'socks5://username:password@socks-proxy.com:1080'
}

browser.open('https://example.com')

Advanced Proxy Configuration

Environment Variable Configuration

You can also configure proxies using environment variables, which is useful for deployment scenarios:

export HTTP_PROXY=http://proxy-server.com:8080
export HTTPS_PROXY=http://proxy-server.com:8080
import mechanicalsoup
import os

browser = mechanicalsoup.StatefulBrowser()
# MechanicalSoup will automatically use environment variables
browser.open('https://example.com')

Custom Session Configuration

For more granular control, you can create a custom session with specific proxy configurations:

import mechanicalsoup
import requests

# Create a custom session
session = requests.Session()
session.proxies = {
    'http': 'http://proxy1.com:8080',
    'https': 'http://proxy2.com:8080'
}

# Set additional session parameters
session.headers.update({
    'User-Agent': 'Mozilla/5.0 (compatible; MechanicalSoup)'
})

# Create browser with custom session
browser = mechanicalsoup.StatefulBrowser(session=session)
browser.open('https://example.com')

Proxy Rotation and Load Balancing

For large-scale scraping operations, you might want to rotate between multiple proxies:

import mechanicalsoup
import random

class ProxyRotator:
    def __init__(self, proxy_list):
        self.proxies = proxy_list
        self.current_proxy = None

    def get_random_proxy(self):
        proxy = random.choice(self.proxies)
        self.current_proxy = proxy
        return {
            'http': proxy,
            'https': proxy
        }

    def create_browser(self):
        browser = mechanicalsoup.StatefulBrowser()
        browser.session.proxies = self.get_random_proxy()
        return browser

# Usage example
proxy_list = [
    'http://proxy1.com:8080',
    'http://proxy2.com:8080',
    'http://proxy3.com:8080'
]

rotator = ProxyRotator(proxy_list)

# Use different proxy for each request
for url in ['https://example1.com', 'https://example2.com']:
    browser = rotator.create_browser()
    browser.open(url)
    print(f"Used proxy: {rotator.current_proxy}")
    print(f"Response: {browser.page.title}")

Error Handling and Troubleshooting

Proxy connections can fail for various reasons. Implement proper error handling:

import mechanicalsoup
import requests
from requests.exceptions import ProxyError, ConnectTimeout

def safe_browse_with_proxy(url, proxy_config, timeout=30):
    browser = mechanicalsoup.StatefulBrowser()
    browser.session.proxies = proxy_config
    browser.session.timeout = timeout

    try:
        browser.open(url)
        return browser.page
    except ProxyError as e:
        print(f"Proxy error: {e}")
        return None
    except ConnectTimeout as e:
        print(f"Connection timeout: {e}")
        return None
    except Exception as e:
        print(f"Unexpected error: {e}")
        return None

# Usage
proxy_config = {
    'http': 'http://proxy-server.com:8080',
    'https': 'http://proxy-server.com:8080'
}

page = safe_browse_with_proxy('https://example.com', proxy_config)
if page:
    print("Successfully retrieved page through proxy")
else:
    print("Failed to retrieve page")

Testing Proxy Configuration

Always verify that your proxy is working correctly:

import mechanicalsoup
import json

def test_proxy_configuration(proxy_config):
    browser = mechanicalsoup.StatefulBrowser()
    browser.session.proxies = proxy_config

    try:
        # Test with a service that returns IP information
        browser.open('https://httpbin.org/ip')
        response_data = json.loads(browser.page.text)
        print(f"External IP: {response_data['origin']}")

        # Test user agent
        browser.open('https://httpbin.org/user-agent')
        ua_data = json.loads(browser.page.text)
        print(f"User Agent: {ua_data['user-agent']}")

        return True
    except Exception as e:
        print(f"Proxy test failed: {e}")
        return False

# Test your proxy
proxy_config = {
    'http': 'http://your-proxy.com:8080',
    'https': 'http://your-proxy.com:8080'
}

if test_proxy_configuration(proxy_config):
    print("Proxy configuration is working correctly")
else:
    print("Proxy configuration needs adjustment")

Performance Considerations

When using proxies with MechanicalSoup, consider these performance factors:

Connection Pooling

import mechanicalsoup
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

browser = mechanicalsoup.StatefulBrowser()

# Configure retry strategy
retry_strategy = Retry(
    total=3,
    status_forcelist=[429, 500, 502, 503, 504],
    method_whitelist=["HEAD", "GET", "OPTIONS"]
)

# Mount adapter with connection pooling
adapter = HTTPAdapter(
    pool_connections=10,
    pool_maxsize=20,
    max_retries=retry_strategy
)

browser.session.mount("http://", adapter)
browser.session.mount("https://", adapter)

# Set proxy
browser.session.proxies = {
    'http': 'http://proxy-server.com:8080',
    'https': 'http://proxy-server.com:8080'
}

Timeout Configuration

import mechanicalsoup

browser = mechanicalsoup.StatefulBrowser()
browser.session.proxies = {
    'http': 'http://proxy-server.com:8080',
    'https': 'http://proxy-server.com:8080'
}

# Set reasonable timeouts
browser.session.timeout = (5, 30)  # (connect_timeout, read_timeout)

Best Practices

  1. Always test proxy configuration before running production scraping tasks
  2. Implement proper error handling for proxy failures
  3. Use connection pooling for better performance with multiple requests
  4. Rotate proxies to distribute load and avoid rate limiting
  5. Monitor proxy performance and replace slow or unreliable proxies
  6. Respect proxy provider's terms of service and usage limits

Similar to how you might handle authentication in Puppeteer for browser-based scraping, proxy authentication in MechanicalSoup requires careful consideration of security and session management.

Conclusion

Setting up a proxy with MechanicalSoup is straightforward and provides essential functionality for robust web scraping applications. Whether you need basic HTTP proxy support or advanced SOCKS proxy configuration with authentication, MechanicalSoup's integration with the Requests library offers flexible options to meet your requirements.

Remember to always test your proxy configuration thoroughly and implement proper error handling to ensure reliable operation. For large-scale operations, consider implementing proxy rotation and monitoring to maintain optimal performance and avoid service disruptions.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon