How do I handle redirects in Scrapy?

Scrapy is a powerful and flexible web scraping framework that handles most of the common situations in the web scraping process, including handling redirects. Here's a detailed guide to handle redirects in Scrapy.

Scrapy has built-in support for dealing with redirects. It automatically follows HTTP 3xx redirection responses based on the REDIRECT_ENABLED settings which is True by default.

You can configure the maximum number of consecutive redirects allowed with the REDIRECT_MAX_TIMES setting, which defaults to 20.

If you wish to ignore certain redirect status codes, you can use the handle_httpstatus_list spider attribute or the HttpErrorMiddleware.

If you need to handle redirections differently, you can override the RedirectMiddleware.

Here's an example:

from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError
from twisted.internet.error import TimeoutError, TCPTimedOutError

class MySpider(scrapy.Spider):
    handle_httpstatus_list = [301, 302] # Add any other status you need

    def start_requests(self):
        for u in self.start_urls:
            yield scrapy.Request(u, callback=self.parse_httpbin,
                                 errback=self.errback_httpbin,
                                 dont_filter=True)

    def parse_httpbin(self, response):
        self.logger.info('Got successful response from {}'.format(response.url))
        # do something useful here...

    def errback_httpbin(self, failure):
        # log all failures
        self.logger.error(repr(failure))

        if failure.check(HttpError):
            response = failure.value.response
            self.logger.error('HttpError on %s', response.url)

        elif failure.check(DNSLookupError):
            request = failure.request
            self.logger.error('DNSLookupError on %s', request.url)

        elif failure.check(TimeoutError, TCPTimedOutError):
            request = failure.request
            self.logger.error('TimeoutError on %s', request.url)

In this example, the handle_httpstatus_list attribute is used to specify which HTTP status codes the spider is going to handle and therefore, allow the parse_httpbin method to handle them. If some error happens while doing the request, the errback_httpbin method is going to handle them.

Remember that Scrapy's built-in RedirectMiddleware will process the responses before your spider, so if you set a status code in handle_httpstatus_list, that response won't be processed by the RedirectMiddleware.

The dont_filter=True in the scrapy.Request is used to allow all redirects, even if the URL was previously visited. By default, Scrapy filters out duplicated requests to URLs already visited, avoiding a common pitfall where a page redirects to itself, creating an infinite loop.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon