How do I implement retry logic in Scrapy?
Implementing retry logic in Scrapy is crucial for building robust web scraping applications that can handle temporary network failures, server errors, and rate limiting. Scrapy provides built-in retry mechanisms through middleware, and you can also implement custom retry logic for specific scenarios.
Built-in Retry Middleware
Scrapy comes with a built-in RetryMiddleware
that automatically retries failed requests. This middleware is enabled by default and handles most common retry scenarios.
Basic Configuration
Configure retry settings in your settings.py
file:
# settings.py
RETRY_ENABLED = True
RETRY_TIMES = 3 # Number of retry attempts
RETRY_HTTP_CODES = [500, 502, 503, 504, 408, 429] # HTTP codes to retry
RETRY_PRIORITY_ADJUST = -1 # Priority adjustment for retried requests
Spider-level Configuration
You can also configure retry settings at the spider level:
import scrapy
class MySpider(scrapy.Spider):
name = 'example_spider'
custom_settings = {
'RETRY_TIMES': 5,
'RETRY_HTTP_CODES': [500, 502, 503, 504, 408, 429, 403],
'RETRY_PRIORITY_ADJUST': -2,
}
def start_requests(self):
urls = ['https://example.com/page1', 'https://example.com/page2']
for url in urls:
yield scrapy.Request(url, callback=self.parse)
def parse(self, response):
# Your parsing logic here
yield {'title': response.css('title::text').get()}
Custom Retry Middleware
For more advanced retry logic, you can create custom retry middleware:
# middlewares.py
import random
import time
from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.utils.response import response_status_message
class CustomRetryMiddleware(RetryMiddleware):
def __init__(self, settings):
super().__init__(settings)
self.max_retry_times = settings.getint("RETRY_TIMES")
self.retry_http_codes = set(int(x) for x in settings.getlist("RETRY_HTTP_CODES"))
self.priority_adjust = settings.getint("RETRY_PRIORITY_ADJUST")
def process_response(self, request, response, spider):
if request.meta.get('dont_retry', False):
return response
# Check if response should be retried
if response.status in self.retry_http_codes:
reason = response_status_message(response.status)
return self._retry(request, reason, spider) or response
# Custom retry condition: empty response body
if len(response.body) < 100:
reason = "Response body too short"
return self._retry(request, reason, spider) or response
return response
def process_exception(self, request, exception, spider):
if isinstance(exception, self.EXCEPTIONS_TO_RETRY) and not request.meta.get('dont_retry', False):
return self._retry(request, exception, spider)
def _retry(self, request, reason, spider):
retries = request.meta.get('retry_times', 0) + 1
if retries <= self.max_retry_times:
spider.logger.debug(f"Retrying {request.url} (failed {retries} times): {reason}")
# Exponential backoff with jitter
delay = min(300, (2 ** retries) + random.uniform(0, 1))
time.sleep(delay)
retryreq = request.copy()
retryreq.meta['retry_times'] = retries
retryreq.priority = request.priority + self.priority_adjust
return retryreq
else:
spider.logger.debug(f"Gave up retrying {request.url} (failed {retries} times): {reason}")
Enable your custom middleware in settings:
# settings.py
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None, # Disable built-in
'myproject.middlewares.CustomRetryMiddleware': 550, # Enable custom
}
Request-level Retry Control
You can control retry behavior for individual requests using meta parameters:
import scrapy
class MySpider(scrapy.Spider):
name = 'selective_retry_spider'
def start_requests(self):
# Request with custom retry settings
yield scrapy.Request(
'https://important-page.com',
callback=self.parse_important,
meta={
'max_retry_times': 10, # Override default retry times
'retry_http_codes': [500, 502, 503, 504, 429]
}
)
# Request with no retries
yield scrapy.Request(
'https://optional-page.com',
callback=self.parse_optional,
meta={'dont_retry': True}
)
def parse_important(self, response):
# Handle critical data
yield {'important_data': response.css('div.content::text').get()}
def parse_optional(self, response):
# Handle non-critical data
yield {'optional_data': response.css('div.sidebar::text').get()}
Retry with Different Strategies
Exponential Backoff
Implement exponential backoff to avoid overwhelming servers:
import random
import time
from scrapy.downloadermiddlewares.retry import RetryMiddleware
class ExponentialBackoffRetryMiddleware(RetryMiddleware):
def _retry(self, request, reason, spider):
retries = request.meta.get('retry_times', 0) + 1
if retries <= self.max_retry_times:
# Exponential backoff: 2^retries + random jitter
delay = min(300, (2 ** retries) + random.uniform(0, 1))
spider.logger.info(
f"Retrying {request.url} in {delay:.2f} seconds "
f"(attempt {retries}/{self.max_retry_times}): {reason}"
)
time.sleep(delay)
retryreq = request.copy()
retryreq.meta['retry_times'] = retries
retryreq.priority = request.priority + self.priority_adjust
return retryreq
else:
spider.logger.error(f"Max retries exceeded for {request.url}: {reason}")
return None
Conditional Retry Based on Response Content
class ContentBasedRetryMiddleware(RetryMiddleware):
def process_response(self, request, response, spider):
if request.meta.get('dont_retry', False):
return response
# Retry if response contains error indicators
if self._should_retry_response(response):
reason = "Response contains error indicators"
return self._retry(request, reason, spider) or response
return response
def _should_retry_response(self, response):
error_indicators = [
'temporarily unavailable',
'rate limit exceeded',
'please try again later',
'service unavailable'
]
response_text = response.text.lower()
return any(indicator in response_text for indicator in error_indicators)
Handling Rate Limiting
Implement smart retry logic for rate-limited responses:
import time
from scrapy.downloadermiddlewares.retry import RetryMiddleware
class RateLimitRetryMiddleware(RetryMiddleware):
def process_response(self, request, response, spider):
if response.status == 429: # Too Many Requests
retry_after = response.headers.get('Retry-After')
if retry_after:
delay = int(retry_after)
spider.logger.info(f"Rate limited. Waiting {delay} seconds before retry.")
time.sleep(delay)
else:
# Default delay if no Retry-After header
delay = 60
spider.logger.info(f"Rate limited. Waiting {delay} seconds before retry.")
time.sleep(delay)
return self._retry(request, "Rate limited", spider) or response
return super().process_response(request, response, spider)
Advanced Retry Patterns
Circuit Breaker Pattern
Implement a circuit breaker to temporarily stop requests to failing domains:
import time
from collections import defaultdict
from scrapy.downloadermiddlewares.retry import RetryMiddleware
class CircuitBreakerRetryMiddleware(RetryMiddleware):
def __init__(self, settings):
super().__init__(settings)
self.failure_counts = defaultdict(int)
self.circuit_open_until = defaultdict(float)
self.failure_threshold = 5
self.circuit_timeout = 300 # 5 minutes
def process_response(self, request, response, spider):
domain = request.url.split('/')[2] # Extract domain
# Check if circuit is open for this domain
if time.time() < self.circuit_open_until[domain]:
spider.logger.warning(f"Circuit breaker open for {domain}. Skipping request.")
return response
if response.status in self.retry_http_codes:
self.failure_counts[domain] += 1
# Open circuit if failure threshold exceeded
if self.failure_counts[domain] >= self.failure_threshold:
self.circuit_open_until[domain] = time.time() + self.circuit_timeout
spider.logger.error(f"Circuit breaker opened for {domain}")
return response
return self._retry(request, f"HTTP {response.status}", spider) or response
else:
# Reset failure count on successful response
self.failure_counts[domain] = 0
return response
Best Practices for Retry Logic
1. Use Appropriate Retry Codes
Only retry on recoverable errors:
# Good practice: Retry on server errors and timeouts
RETRY_HTTP_CODES = [500, 502, 503, 504, 408, 429]
# Avoid retrying client errors (usually permanent)
# Don't include: 400, 401, 403, 404
2. Implement Logging and Monitoring
Add comprehensive logging to track retry behavior:
import logging
from scrapy.downloadermiddlewares.retry import RetryMiddleware
class LoggingRetryMiddleware(RetryMiddleware):
def _retry(self, request, reason, spider):
retries = request.meta.get('retry_times', 0) + 1
if retries <= self.max_retry_times:
spider.logger.info(
f"RETRY: {request.url} | Attempt: {retries}/{self.max_retry_times} | Reason: {reason}"
)
# Log to external monitoring system
spider.crawler.stats.inc_value('retry_count')
spider.crawler.stats.inc_value(f'retry_reason/{reason}')
return super()._retry(request, reason, spider)
else:
spider.logger.error(f"RETRY_FAILED: {request.url} | Max retries exceeded")
spider.crawler.stats.inc_value('retry_max_reached')
3. Handle Different Types of Failures
When building robust scrapers, it's important to understand that different types of failures require different retry strategies. Similar to how you might handle network timeouts in Puppeteer, Scrapy requires careful consideration of various failure modes.
class SmartRetryMiddleware(RetryMiddleware):
def process_exception(self, request, exception, spider):
if isinstance(exception, (TimeoutError, ConnectionError)):
# Network issues - retry with longer delay
return self._retry_with_delay(request, "Network timeout", spider, delay=30)
elif isinstance(exception, DNSLookupError):
# DNS issues - don't retry immediately
return self._retry_with_delay(request, "DNS lookup failed", spider, delay=120)
else:
return super().process_exception(request, exception, spider)
def _retry_with_delay(self, request, reason, spider, delay):
time.sleep(delay)
return self._retry(request, reason, spider)
JavaScript Code Example for Comparison
For developers familiar with Node.js, here's how similar retry logic might look in JavaScript:
const axios = require('axios');
class RetryClient {
constructor(maxRetries = 3, baseDelay = 1000) {
this.maxRetries = maxRetries;
this.baseDelay = baseDelay;
}
async fetchWithRetry(url, options = {}) {
let lastError;
for (let attempt = 0; attempt <= this.maxRetries; attempt++) {
try {
const response = await axios.get(url, options);
return response.data;
} catch (error) {
lastError = error;
if (attempt === this.maxRetries) {
throw error;
}
// Check if error is retryable
if (this.isRetryableError(error)) {
const delay = this.calculateDelay(attempt);
console.log(`Retrying ${url} in ${delay}ms (attempt ${attempt + 1})`);
await this.sleep(delay);
} else {
throw error;
}
}
}
}
isRetryableError(error) {
if (!error.response) return true; // Network error
const status = error.response.status;
return [500, 502, 503, 504, 408, 429].includes(status);
}
calculateDelay(attempt) {
// Exponential backoff with jitter
return Math.min(300000, (Math.pow(2, attempt) * this.baseDelay) + Math.random() * 1000);
}
sleep(ms) {
return new Promise(resolve => setTimeout(resolve, ms));
}
}
// Usage
const client = new RetryClient(5, 1000);
client.fetchWithRetry('https://api.example.com/data')
.then(data => console.log(data))
.catch(error => console.error('Failed after all retries:', error));
Testing Retry Logic
Create tests to verify your retry implementation:
# test_retry.py
import unittest
from unittest.mock import Mock, patch
from scrapy.http import Request, Response
from scrapy.spiders import Spider
from myproject.middlewares import CustomRetryMiddleware
class TestRetryMiddleware(unittest.TestCase):
def setUp(self):
self.spider = Spider('test')
self.middleware = CustomRetryMiddleware({
'RETRY_TIMES': 3,
'RETRY_HTTP_CODES': [500, 502, 503],
'RETRY_PRIORITY_ADJUST': -1
})
def test_retry_on_server_error(self):
request = Request('http://example.com')
response = Response('http://example.com', status=500)
result = self.middleware.process_response(request, response, self.spider)
self.assertIsInstance(result, Request)
self.assertEqual(result.meta['retry_times'], 1)
def test_no_retry_on_success(self):
request = Request('http://example.com')
response = Response('http://example.com', status=200)
result = self.middleware.process_response(request, response, self.spider)
self.assertEqual(result, response)
def test_max_retries_exceeded(self):
request = Request('http://example.com', meta={'retry_times': 3})
response = Response('http://example.com', status=500)
result = self.middleware.process_response(request, response, self.spider)
self.assertEqual(result, response) # Should not retry anymore
Command Line Testing
Test your retry logic using Scrapy's built-in tools:
# Run spider with verbose logging to see retry attempts
scrapy crawl myspider -L DEBUG
# Test specific URLs with retry logic
scrapy shell "https://httpstat.us/500"
# Monitor retry statistics
scrapy crawl myspider -s RETRY_TIMES=5 -s LOG_LEVEL=INFO
Monitoring and Debugging
Set up monitoring to track retry performance:
# Custom stats collection
class RetryStatsMiddleware(RetryMiddleware):
def process_response(self, request, response, spider):
# Track response codes
spider.crawler.stats.inc_value(f'response_status_count/{response.status}')
if response.status in self.retry_http_codes:
spider.crawler.stats.inc_value('retry_triggered_count')
spider.crawler.stats.inc_value(f'retry_status/{response.status}')
return super().process_response(request, response, spider)
Enable stats collection:
# View stats after spider completion
scrapy crawl myspider -s STATS_CLASS=scrapy.statscollectors.MemoryStatsCollector
Conclusion
Implementing effective retry logic in Scrapy involves understanding the built-in retry middleware, creating custom retry strategies for specific needs, and following best practices for robust web scraping. Key considerations include:
- Using appropriate HTTP status codes for retries
- Implementing exponential backoff to avoid overwhelming servers
- Adding comprehensive logging and monitoring
- Testing retry logic thoroughly
- Considering different failure modes and recovery strategies
Much like implementing error handling patterns in other scraping tools, proper retry logic ensures your Scrapy spiders can handle the unpredictable nature of web scraping while maintaining efficiency and respecting target websites.
By combining Scrapy's built-in retry capabilities with custom middleware tailored to your specific requirements, you can build resilient scrapers that gracefully handle failures and maximize data collection success rates. Whether you're dealing with temporary network issues, rate limiting, or server errors, a well-implemented retry strategy is essential for production-ready web scraping applications.