Table of contents

How do I use proxy servers with Scrapy?

Proxy servers are essential tools for web scraping projects, especially when dealing with websites that implement rate limiting, IP blocking, or geographical restrictions. Scrapy provides several built-in mechanisms and third-party solutions for implementing proxy support in your web scraping projects.

Why Use Proxy Servers with Scrapy?

Proxy servers offer several benefits for web scraping:

  • IP Rotation: Distribute requests across multiple IP addresses to avoid rate limiting
  • Geographical Access: Access geo-restricted content by using proxies from different locations
  • Anonymity: Hide your real IP address from target websites
  • Scalability: Handle large-scale scraping operations without triggering anti-bot measures
  • Redundancy: Continue scraping even if some proxy servers become unavailable

Basic Proxy Configuration

Setting a Single Proxy

The simplest way to use a proxy with Scrapy is to set it in your spider's meta parameter:

import scrapy

class MySpider(scrapy.Spider):
    name = 'proxy_spider'
    start_urls = ['https://httpbin.org/ip']

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(
                url=url,
                meta={'proxy': 'http://proxy-server:8080'},
                callback=self.parse
            )

    def parse(self, response):
        self.logger.info(f"Response from {response.url}: {response.text}")

Using Proxy with Authentication

For proxies that require authentication, include credentials in the proxy URL:

def start_requests(self):
    proxy = 'http://username:password@proxy-server:8080'
    for url in self.start_urls:
        yield scrapy.Request(
            url=url,
            meta={'proxy': proxy},
            callback=self.parse
        )

Implementing Proxy Rotation

Custom Proxy Middleware

Create a custom middleware to rotate between multiple proxy servers:

# middlewares.py
import random
from scrapy.downloadermiddlewares.httpproxy import HttpProxyMiddleware

class RotateProxyMiddleware(HttpProxyMiddleware):
    def __init__(self):
        self.proxies = [
            'http://proxy1:8080',
            'http://proxy2:8080',
            'http://proxy3:8080',
            'http://username:password@proxy4:8080',
        ]

    def process_request(self, request, spider):
        proxy = random.choice(self.proxies)
        request.meta['proxy'] = proxy
        spider.logger.info(f"Using proxy: {proxy}")
        return None

Enable the middleware in your settings.py:

# settings.py
DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.RotateProxyMiddleware': 350,
}

Advanced Proxy Rotation with Failure Handling

Implement a more sophisticated proxy rotation system that handles failed proxies:

# middlewares.py
import random
import logging
from scrapy.downloadermiddlewares.httpproxy import HttpProxyMiddleware
from scrapy.exceptions import NotConfigured

class SmartProxyMiddleware(HttpProxyMiddleware):
    def __init__(self, proxy_list=None, proxy_auth=None):
        self.proxies = proxy_list or []
        self.proxy_auth = proxy_auth or {}
        self.failed_proxies = set()

        if not self.proxies:
            raise NotConfigured('No proxies configured')

    @classmethod
    def from_crawler(cls, crawler):
        settings = crawler.settings
        proxy_list = settings.getlist('PROXY_LIST')
        proxy_auth = settings.getdict('PROXY_AUTH', {})

        return cls(proxy_list=proxy_list, proxy_auth=proxy_auth)

    def get_random_proxy(self):
        available_proxies = [p for p in self.proxies if p not in self.failed_proxies]
        if not available_proxies:
            # Reset failed proxies if all have failed
            self.failed_proxies.clear()
            available_proxies = self.proxies

        return random.choice(available_proxies)

    def process_request(self, request, spider):
        if 'proxy' not in request.meta:
            proxy = self.get_random_proxy()
            request.meta['proxy'] = proxy

            # Add authentication if required
            proxy_host = proxy.split('://')[1].split(':')[0]
            if proxy_host in self.proxy_auth:
                auth = self.proxy_auth[proxy_host]
                request.meta['proxy'] = f"http://{auth['username']}:{auth['password']}@{proxy_host}:{auth['port']}"

        return None

    def process_exception(self, request, exception, spider):
        proxy = request.meta.get('proxy')
        if proxy:
            spider.logger.warning(f"Proxy {proxy} failed with exception: {exception}")
            self.failed_proxies.add(proxy)

        # Retry with a different proxy
        new_proxy = self.get_random_proxy()
        request.meta['proxy'] = new_proxy
        spider.logger.info(f"Retrying with proxy: {new_proxy}")

        return request

Configure the middleware in settings.py:

# settings.py
PROXY_LIST = [
    'http://proxy1:8080',
    'http://proxy2:8080',
    'http://proxy3:8080',
]

PROXY_AUTH = {
    'proxy1': {'username': 'user1', 'password': 'pass1', 'port': 8080},
    'proxy2': {'username': 'user2', 'password': 'pass2', 'port': 8080},
}

DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.SmartProxyMiddleware': 350,
}

Using Third-Party Proxy Services

Rotating Proxy Services

Many commercial proxy services provide rotating proxy endpoints. For JavaScript-rendered content that requires more advanced browser automation, you might need to consider using Puppeteer with proxy servers for more complex scenarios:

# For services like ProxyMesh, Luminati, or Smartproxy
import scrapy

class CommercialProxySpider(scrapy.Spider):
    name = 'commercial_proxy'

    custom_settings = {
        'DOWNLOADER_MIDDLEWARES': {
            'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
        }
    }

    def start_requests(self):
        proxy = 'http://username:password@rotating-residential.proxymesh.com:31280'

        for url in self.start_urls:
            yield scrapy.Request(
                url=url,
                meta={'proxy': proxy},
                callback=self.parse
            )

Free Proxy Integration

For testing purposes, you can integrate free proxy services:

import requests
import scrapy
import random

class FreeProxySpider(scrapy.Spider):
    name = 'free_proxy'

    def __init__(self):
        self.proxy_list = self.get_free_proxies()

    def get_free_proxies(self):
        """Fetch free proxies from a public API"""
        try:
            response = requests.get('https://api.proxyscrape.com/v2/?request=get&protocol=http&timeout=10000&country=all&ssl=all&anonymity=all')
            proxies = response.text.strip().split('\n')
            return [f'http://{proxy}' for proxy in proxies if proxy]
        except Exception as e:
            self.logger.error(f"Failed to fetch free proxies: {e}")
            return []

    def start_requests(self):
        for url in self.start_urls:
            if self.proxy_list:
                proxy = random.choice(self.proxy_list)
                yield scrapy.Request(
                    url=url,
                    meta={'proxy': proxy},
                    callback=self.parse,
                    dont_filter=True
                )

Handling HTTPS with Proxies

For HTTPS requests through proxies, you may need additional configuration:

# settings.py
DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}

# Enable HTTPS proxy support
DOWNLOADER_CLIENT_TLS_METHOD = 'TLSv1.2'

# Custom middleware for HTTPS proxy handling
class HttpsProxyMiddleware:
    def process_request(self, request, spider):
        if request.url.startswith('https://'):
            request.meta['proxy'] = 'https://proxy-server:8080'
        else:
            request.meta['proxy'] = 'http://proxy-server:8080'

Testing and Monitoring Proxy Performance

Proxy Health Check

Implement a system to monitor proxy health:

import time
import requests
from concurrent.futures import ThreadPoolExecutor

class ProxyTester:
    def __init__(self, proxies, test_url='http://httpbin.org/ip', timeout=10):
        self.proxies = proxies
        self.test_url = test_url
        self.timeout = timeout

    def test_proxy(self, proxy):
        """Test a single proxy"""
        try:
            start_time = time.time()
            response = requests.get(
                self.test_url,
                proxies={'http': proxy, 'https': proxy},
                timeout=self.timeout
            )
            response_time = time.time() - start_time

            if response.status_code == 200:
                return {
                    'proxy': proxy,
                    'status': 'working',
                    'response_time': response_time,
                    'ip': response.json().get('origin', 'unknown')
                }
        except Exception as e:
            return {
                'proxy': proxy,
                'status': 'failed',
                'error': str(e)
            }

    def test_all_proxies(self, max_workers=10):
        """Test all proxies concurrently"""
        with ThreadPoolExecutor(max_workers=max_workers) as executor:
            results = list(executor.map(self.test_proxy, self.proxies))

        working_proxies = [r for r in results if r and r['status'] == 'working']
        failed_proxies = [r for r in results if r and r['status'] == 'failed']

        return {
            'working': working_proxies,
            'failed': failed_proxies,
            'success_rate': len(working_proxies) / len(self.proxies) * 100
        }

# Usage example
proxies = [
    'http://proxy1:8080',
    'http://proxy2:8080',
    'http://proxy3:8080',
]

tester = ProxyTester(proxies)
results = tester.test_all_proxies()
print(f"Success rate: {results['success_rate']:.2f}%")

Command Line Configuration

You can also configure proxies via command line when running Scrapy:

# Single proxy
scrapy crawl myspider -s HTTP_PROXY=http://proxy-server:8080

# With authentication
scrapy crawl myspider -s HTTP_PROXY=http://username:password@proxy-server:8080

# Different proxies for HTTP and HTTPS
scrapy crawl myspider -s HTTP_PROXY=http://proxy1:8080 -s HTTPS_PROXY=https://proxy2:8080

Best Practices for Proxy Usage

1. Implement Proper Error Handling

class RobustProxyMiddleware:
    def process_exception(self, request, exception, spider):
        # Log proxy failures
        proxy = request.meta.get('proxy')
        spider.logger.warning(f"Request failed with proxy {proxy}: {exception}")

        # Implement retry logic with different proxy
        retry_times = request.meta.get('retry_times', 0)
        if retry_times < 3:
            new_request = request.copy()
            new_request.meta['retry_times'] = retry_times + 1
            new_request.meta['proxy'] = self.get_alternative_proxy()
            return new_request

2. Monitor Proxy Performance

# settings.py
EXTENSIONS = {
    'myproject.extensions.ProxyStatsExtension': 500,
}

# extensions.py
class ProxyStatsExtension:
    def __init__(self, crawler):
        self.crawler = crawler
        self.proxy_stats = {}

    def spider_opened(self, spider):
        spider.logger.info("Proxy monitoring started")

    def response_received(self, response, request, spider):
        proxy = request.meta.get('proxy')
        if proxy:
            if proxy not in self.proxy_stats:
                self.proxy_stats[proxy] = {'success': 0, 'total': 0}

            self.proxy_stats[proxy]['total'] += 1
            if response.status == 200:
                self.proxy_stats[proxy]['success'] += 1

    def spider_closed(self, spider):
        for proxy, stats in self.proxy_stats.items():
            success_rate = stats['success'] / stats['total'] * 100
            spider.logger.info(f"Proxy {proxy}: {success_rate:.2f}% success rate")

3. Respect Rate Limits

Similar to handling timeouts in Puppeteer, implementing proper delays is crucial for proxy-based scraping:

# settings.py
DOWNLOAD_DELAY = 1  # 1 second delay between requests
RANDOMIZE_DOWNLOAD_DELAY = 0.5  # 0.5 * to 1.5 * DOWNLOAD_DELAY
CONCURRENT_REQUESTS_PER_DOMAIN = 2  # Limit concurrent requests

Troubleshooting Common Issues

Connection Timeouts

# settings.py
DOWNLOAD_TIMEOUT = 30  # Increase timeout for proxy connections
RETRY_TIMES = 3  # Retry failed requests
RETRY_HTTP_CODES = [500, 502, 503, 504, 408, 429]

Authentication Errors

def handle_proxy_auth_error(response):
    if response.status == 407:  # Proxy Authentication Required
        # Log authentication failure
        proxy = response.request.meta.get('proxy')
        logging.error(f"Proxy authentication failed for {proxy}")

        # Try alternative proxy or refresh credentials
        return alternative_proxy_request(response.request)

Integration with Other Tools

When dealing with JavaScript-heavy websites, you might need to combine Scrapy with browser automation tools. Consider running multiple pages in parallel with Puppeteer for complex scenarios that require both proxy support and JavaScript execution.

Conclusion

Using proxy servers with Scrapy is essential for large-scale web scraping operations. By implementing proper proxy rotation, error handling, and monitoring, you can create robust scraping systems that can handle complex anti-bot measures while maintaining high performance and reliability. Whether you choose to implement your own proxy management system or use commercial proxy services, the key is to ensure proper configuration and monitoring for optimal results.

Remember to always respect websites' terms of service and robots.txt files, implement appropriate delays between requests, and monitor your proxy performance to maintain ethical and effective scraping practices.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon