Handling HTTP errors in Scrapy is crucial when developing a robust web scraper. Scrapy allows for handling HTTP errors through the use of HTTP error middlewares. By default, Scrapy uses its own HttpErrorMiddleware
to process HTTP response errors.
Here's a step-by-step guide to handle HTTP errors in Scrapy:
Step 1: Enable HttpErrorMiddleware
First, make sure that HttpErrorMiddleware
is enabled in your settings.py
file. This middleware should be enabled by default but if it's not, you can enable it by adding the following line:
DOWNLOADER_MIDDLEWARES = {'scrapy.downloadermiddlewares.httperror.HttpErrorMiddleware': 50}
Step 2: Handle HTTP Errors
Scrapy sends non-200 responses (HTTP errors) to the callback function specified in your request, with an additional errback
function which will be called if an error happens while processing the request.
Here's an example of how to handle HTTP errors:
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['http://httpbin.org/status/404']
def start_requests(self):
for u in self.start_urls:
yield scrapy.Request(u, callback=self.parse_httpbin, errback=self.errback_httpbin, dont_filter=True)
def parse_httpbin(self, response):
self.logger.info('Got successful response from %s', response.url)
# Normal processing...
def errback_httpbin(self, failure):
# log all failures
self.logger.error(repr(failure))
# In case you want to do something special for some errors,
# you may need the failure's type:
if failure.check(HttpError):
response = failure.value.response
self.logger.error('HttpError on %s', response.url)
elif failure.check(DNSLookupError):
request = failure.request
self.logger.error('DNSLookupError on %s', request.url)
elif failure.check(TimeoutError, TCPTimedOutError):
request = failure.request
self.logger.error('TimeoutError on %s', request.url)
In this example, if the response code is not 200, Scrapy will call the errback_httpbin
method. This method logs the error, checks the type of error, and performs different actions depending on the type of error.
Step 3: Handle Specific HTTP Errors
If you want to handle specific HTTP errors, you can use HttpError
's trap
method to identify them.
Here's an example:
from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError
from twisted.internet.error import TimeoutError, TCPTimedOutError
def errback_httpbin(self, failure):
# log all failures
self.logger.error(repr(failure))
if failure.check(HttpError):
response = failure.value.response
if response.status == 404:
self.logger.error('404 error on %s', response.url)
elif response.status == 500:
self.logger.error('500 error on %s', response.url)
In this example, the errback_httpbin
method checks if the response code is 404 or 500 and logs a different message for each one.
In conclusion, handling HTTP errors in Scrapy is a straightforward process. By using the HttpErrorMiddleware
and the errback
function, you can ensure your spider handles HTTP errors gracefully and logs useful information when an error occurs.