Rotating user agents in Scrapy is a common technique used to prevent getting blocked while scraping websites. Here is how you can do it:
- First, you need to create a list of user agents. You can find a list of user agents on the internet or you can create one yourself. Here is a sample list:
USER_AGENT_LIST = [
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
...
]
- Then, you need to create a middleware that will rotate the user agents. You can do this by creating a new python file, let's call it
middlewares.py
, and add the following code:
import random
from scrapy import signals
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
class RotateUserAgentMiddleware(UserAgentMiddleware):
def __init__(self, user_agent=''):
self.user_agent = user_agent
def process_request(self, request, spider):
ua = random.choice(spider.settings.getattr('USER_AGENT_LIST'))
if ua:
request.headers.setdefault('User-Agent', ua)
This code creates a new middleware class that inherits from UserAgentMiddleware
. In the process_request
method, it selects a random user agent from the list and sets it in the headers of the request.
- Finally, you need to enable this middleware in your settings. You can do this by adding the following lines to your
settings.py
file:
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.RotateUserAgentMiddleware': 110,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
}
In the DOWNLOADER_MIDDLEWARES
setting, you need to disable the default UserAgentMiddleware
and enable your custom middleware. The number 110
is the order in which the middlewares are processed. The lower the number, the sooner it will be processed.
Now, every time you make a request, Scrapy will use a different user agent from your list.