How do I use proxies with Scrapy?

Using proxies with Scrapy is quite straightforward and can be done by using the http_proxy middleware and setting the http_proxy environment variable.

Here's how you can do it:

Option 1:

Step 1: Enable HttpProxyMiddleware

First, you need to enable the HttpProxyMiddleware in your settings.py file:

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 1,
}

Step 2: Set http_proxy environment variable

Then, you can set the http_proxy environment variable when you run your Scrapy spider:

http_proxy=http://yourproxy:yourport scrapy crawl yourspider

Option 2:

Another way is to set the proxy directly in your spider. Here's how you can do it:

class MySpider(scrapy.Spider):
    name = 'myspider'

    def start_requests(self):
        yield scrapy.Request('http://example.com', meta={'proxy': 'http://yourproxy:yourport'})

Using Rotating Proxies:

If you want to use rotating proxies (a list of proxies), you can modify the start_requests method to select a random proxy each time a request is made:

import random

class MySpider(scrapy.Spider):
    name = 'myspider'
    proxies = [
        'http://proxy1.com:port',
        'http://proxy2.com:port',
        # ... more proxies ...
    ]

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, meta={'proxy': random.choice(self.proxies)})

Remember, the proxy string should be in the following format: http://yourproxy:yourport. Also, if the proxy requires authentication, you can include the username and password like this: http://user:password@yourproxy:yourport.

In case you want to use different proxies for different types of requests, you can set the proxy meta key for each Request object individually.

Please note that using proxies can slow down your web scraping because you're routing your requests through a third-party server. However, it can help you avoid getting blocked by the target website.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon