Scrapy is a powerful and flexible web scraping framework that handles most of the common situations in the web scraping process, including handling redirects. Here's a detailed guide to handle redirects in Scrapy.
Scrapy has built-in support for dealing with redirects. It automatically follows HTTP 3xx redirection responses based on the REDIRECT_ENABLED
settings which is True
by default.
You can configure the maximum number of consecutive redirects allowed with the REDIRECT_MAX_TIMES
setting, which defaults to 20.
If you wish to ignore certain redirect status codes, you can use the handle_httpstatus_list
spider attribute or the HttpErrorMiddleware
.
If you need to handle redirections differently, you can override the RedirectMiddleware
.
Here's an example:
from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError
from twisted.internet.error import TimeoutError, TCPTimedOutError
class MySpider(scrapy.Spider):
handle_httpstatus_list = [301, 302] # Add any other status you need
def start_requests(self):
for u in self.start_urls:
yield scrapy.Request(u, callback=self.parse_httpbin,
errback=self.errback_httpbin,
dont_filter=True)
def parse_httpbin(self, response):
self.logger.info('Got successful response from {}'.format(response.url))
# do something useful here...
def errback_httpbin(self, failure):
# log all failures
self.logger.error(repr(failure))
if failure.check(HttpError):
response = failure.value.response
self.logger.error('HttpError on %s', response.url)
elif failure.check(DNSLookupError):
request = failure.request
self.logger.error('DNSLookupError on %s', request.url)
elif failure.check(TimeoutError, TCPTimedOutError):
request = failure.request
self.logger.error('TimeoutError on %s', request.url)
In this example, the handle_httpstatus_list
attribute is used to specify which HTTP status codes the spider is going to handle and therefore, allow the parse_httpbin
method to handle them. If some error happens while doing the request, the errback_httpbin
method is going to handle them.
Remember that Scrapy's built-in RedirectMiddleware will process the responses before your spider, so if you set a status code in handle_httpstatus_list
, that response won't be processed by the RedirectMiddleware.
The dont_filter=True
in the scrapy.Request
is used to allow all redirects, even if the URL was previously visited. By default, Scrapy filters out duplicated requests to URLs already visited, avoiding a common pitfall where a page redirects to itself, creating an infinite loop.