Using proxies with Scrapy is quite straightforward and can be done by using the http_proxy
middleware and setting the http_proxy
environment variable.
Here's how you can do it:
Option 1:
Step 1: Enable HttpProxyMiddleware
First, you need to enable the HttpProxyMiddleware in your settings.py file:
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 1,
}
Step 2: Set http_proxy
environment variable
Then, you can set the http_proxy
environment variable when you run your Scrapy spider:
http_proxy=http://yourproxy:yourport scrapy crawl yourspider
Option 2:
Another way is to set the proxy directly in your spider. Here's how you can do it:
class MySpider(scrapy.Spider):
name = 'myspider'
def start_requests(self):
yield scrapy.Request('http://example.com', meta={'proxy': 'http://yourproxy:yourport'})
Using Rotating Proxies:
If you want to use rotating proxies (a list of proxies), you can modify the start_requests
method to select a random proxy each time a request is made:
import random
class MySpider(scrapy.Spider):
name = 'myspider'
proxies = [
'http://proxy1.com:port',
'http://proxy2.com:port',
# ... more proxies ...
]
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, meta={'proxy': random.choice(self.proxies)})
Remember, the proxy string should be in the following format: http://yourproxy:yourport
. Also, if the proxy requires authentication, you can include the username and password like this: http://user:password@yourproxy:yourport
.
In case you want to use different proxies for different types of requests, you can set the proxy
meta key for each Request
object individually.
Please note that using proxies can slow down your web scraping because you're routing your requests through a third-party server. However, it can help you avoid getting blocked by the target website.