Table of contents

How do I implement retry logic in Scrapy?

Implementing retry logic in Scrapy is crucial for building robust web scraping applications that can handle temporary network failures, server errors, and rate limiting. Scrapy provides built-in retry mechanisms through middleware, and you can also implement custom retry logic for specific scenarios.

Built-in Retry Middleware

Scrapy comes with a built-in RetryMiddleware that automatically retries failed requests. This middleware is enabled by default and handles most common retry scenarios.

Basic Configuration

Configure retry settings in your settings.py file:

# settings.py
RETRY_ENABLED = True
RETRY_TIMES = 3  # Number of retry attempts
RETRY_HTTP_CODES = [500, 502, 503, 504, 408, 429]  # HTTP codes to retry
RETRY_PRIORITY_ADJUST = -1  # Priority adjustment for retried requests

Spider-level Configuration

You can also configure retry settings at the spider level:

import scrapy

class MySpider(scrapy.Spider):
    name = 'example_spider'
    custom_settings = {
        'RETRY_TIMES': 5,
        'RETRY_HTTP_CODES': [500, 502, 503, 504, 408, 429, 403],
        'RETRY_PRIORITY_ADJUST': -2,
    }

    def start_requests(self):
        urls = ['https://example.com/page1', 'https://example.com/page2']
        for url in urls:
            yield scrapy.Request(url, callback=self.parse)

    def parse(self, response):
        # Your parsing logic here
        yield {'title': response.css('title::text').get()}

Custom Retry Middleware

For more advanced retry logic, you can create custom retry middleware:

# middlewares.py
import random
import time
from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.utils.response import response_status_message

class CustomRetryMiddleware(RetryMiddleware):

    def __init__(self, settings):
        super().__init__(settings)
        self.max_retry_times = settings.getint("RETRY_TIMES")
        self.retry_http_codes = set(int(x) for x in settings.getlist("RETRY_HTTP_CODES"))
        self.priority_adjust = settings.getint("RETRY_PRIORITY_ADJUST")

    def process_response(self, request, response, spider):
        if request.meta.get('dont_retry', False):
            return response

        # Check if response should be retried
        if response.status in self.retry_http_codes:
            reason = response_status_message(response.status)
            return self._retry(request, reason, spider) or response

        # Custom retry condition: empty response body
        if len(response.body) < 100:
            reason = "Response body too short"
            return self._retry(request, reason, spider) or response

        return response

    def process_exception(self, request, exception, spider):
        if isinstance(exception, self.EXCEPTIONS_TO_RETRY) and not request.meta.get('dont_retry', False):
            return self._retry(request, exception, spider)

    def _retry(self, request, reason, spider):
        retries = request.meta.get('retry_times', 0) + 1

        if retries <= self.max_retry_times:
            spider.logger.debug(f"Retrying {request.url} (failed {retries} times): {reason}")

            # Exponential backoff with jitter
            delay = min(300, (2 ** retries) + random.uniform(0, 1))
            time.sleep(delay)

            retryreq = request.copy()
            retryreq.meta['retry_times'] = retries
            retryreq.priority = request.priority + self.priority_adjust

            return retryreq
        else:
            spider.logger.debug(f"Gave up retrying {request.url} (failed {retries} times): {reason}")

Enable your custom middleware in settings:

# settings.py
DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,  # Disable built-in
    'myproject.middlewares.CustomRetryMiddleware': 550,  # Enable custom
}

Request-level Retry Control

You can control retry behavior for individual requests using meta parameters:

import scrapy

class MySpider(scrapy.Spider):
    name = 'selective_retry_spider'

    def start_requests(self):
        # Request with custom retry settings
        yield scrapy.Request(
            'https://important-page.com',
            callback=self.parse_important,
            meta={
                'max_retry_times': 10,  # Override default retry times
                'retry_http_codes': [500, 502, 503, 504, 429]
            }
        )

        # Request with no retries
        yield scrapy.Request(
            'https://optional-page.com',
            callback=self.parse_optional,
            meta={'dont_retry': True}
        )

    def parse_important(self, response):
        # Handle critical data
        yield {'important_data': response.css('div.content::text').get()}

    def parse_optional(self, response):
        # Handle non-critical data
        yield {'optional_data': response.css('div.sidebar::text').get()}

Retry with Different Strategies

Exponential Backoff

Implement exponential backoff to avoid overwhelming servers:

import random
import time
from scrapy.downloadermiddlewares.retry import RetryMiddleware

class ExponentialBackoffRetryMiddleware(RetryMiddleware):

    def _retry(self, request, reason, spider):
        retries = request.meta.get('retry_times', 0) + 1

        if retries <= self.max_retry_times:
            # Exponential backoff: 2^retries + random jitter
            delay = min(300, (2 ** retries) + random.uniform(0, 1))

            spider.logger.info(
                f"Retrying {request.url} in {delay:.2f} seconds "
                f"(attempt {retries}/{self.max_retry_times}): {reason}"
            )

            time.sleep(delay)

            retryreq = request.copy()
            retryreq.meta['retry_times'] = retries
            retryreq.priority = request.priority + self.priority_adjust

            return retryreq
        else:
            spider.logger.error(f"Max retries exceeded for {request.url}: {reason}")
            return None

Conditional Retry Based on Response Content

class ContentBasedRetryMiddleware(RetryMiddleware):

    def process_response(self, request, response, spider):
        if request.meta.get('dont_retry', False):
            return response

        # Retry if response contains error indicators
        if self._should_retry_response(response):
            reason = "Response contains error indicators"
            return self._retry(request, reason, spider) or response

        return response

    def _should_retry_response(self, response):
        error_indicators = [
            'temporarily unavailable',
            'rate limit exceeded',
            'please try again later',
            'service unavailable'
        ]

        response_text = response.text.lower()
        return any(indicator in response_text for indicator in error_indicators)

Handling Rate Limiting

Implement smart retry logic for rate-limited responses:

import time
from scrapy.downloadermiddlewares.retry import RetryMiddleware

class RateLimitRetryMiddleware(RetryMiddleware):

    def process_response(self, request, response, spider):
        if response.status == 429:  # Too Many Requests
            retry_after = response.headers.get('Retry-After')

            if retry_after:
                delay = int(retry_after)
                spider.logger.info(f"Rate limited. Waiting {delay} seconds before retry.")
                time.sleep(delay)
            else:
                # Default delay if no Retry-After header
                delay = 60
                spider.logger.info(f"Rate limited. Waiting {delay} seconds before retry.")
                time.sleep(delay)

            return self._retry(request, "Rate limited", spider) or response

        return super().process_response(request, response, spider)

Advanced Retry Patterns

Circuit Breaker Pattern

Implement a circuit breaker to temporarily stop requests to failing domains:

import time
from collections import defaultdict
from scrapy.downloadermiddlewares.retry import RetryMiddleware

class CircuitBreakerRetryMiddleware(RetryMiddleware):

    def __init__(self, settings):
        super().__init__(settings)
        self.failure_counts = defaultdict(int)
        self.circuit_open_until = defaultdict(float)
        self.failure_threshold = 5
        self.circuit_timeout = 300  # 5 minutes

    def process_response(self, request, response, spider):
        domain = request.url.split('/')[2]  # Extract domain

        # Check if circuit is open for this domain
        if time.time() < self.circuit_open_until[domain]:
            spider.logger.warning(f"Circuit breaker open for {domain}. Skipping request.")
            return response

        if response.status in self.retry_http_codes:
            self.failure_counts[domain] += 1

            # Open circuit if failure threshold exceeded
            if self.failure_counts[domain] >= self.failure_threshold:
                self.circuit_open_until[domain] = time.time() + self.circuit_timeout
                spider.logger.error(f"Circuit breaker opened for {domain}")
                return response

            return self._retry(request, f"HTTP {response.status}", spider) or response
        else:
            # Reset failure count on successful response
            self.failure_counts[domain] = 0

        return response

Best Practices for Retry Logic

1. Use Appropriate Retry Codes

Only retry on recoverable errors:

# Good practice: Retry on server errors and timeouts
RETRY_HTTP_CODES = [500, 502, 503, 504, 408, 429]

# Avoid retrying client errors (usually permanent)
# Don't include: 400, 401, 403, 404

2. Implement Logging and Monitoring

Add comprehensive logging to track retry behavior:

import logging
from scrapy.downloadermiddlewares.retry import RetryMiddleware

class LoggingRetryMiddleware(RetryMiddleware):

    def _retry(self, request, reason, spider):
        retries = request.meta.get('retry_times', 0) + 1

        if retries <= self.max_retry_times:
            spider.logger.info(
                f"RETRY: {request.url} | Attempt: {retries}/{self.max_retry_times} | Reason: {reason}"
            )

            # Log to external monitoring system
            spider.crawler.stats.inc_value('retry_count')
            spider.crawler.stats.inc_value(f'retry_reason/{reason}')

            return super()._retry(request, reason, spider)
        else:
            spider.logger.error(f"RETRY_FAILED: {request.url} | Max retries exceeded")
            spider.crawler.stats.inc_value('retry_max_reached')

3. Handle Different Types of Failures

When building robust scrapers, it's important to understand that different types of failures require different retry strategies. Similar to how you might handle network timeouts in Puppeteer, Scrapy requires careful consideration of various failure modes.

class SmartRetryMiddleware(RetryMiddleware):

    def process_exception(self, request, exception, spider):
        if isinstance(exception, (TimeoutError, ConnectionError)):
            # Network issues - retry with longer delay
            return self._retry_with_delay(request, "Network timeout", spider, delay=30)
        elif isinstance(exception, DNSLookupError):
            # DNS issues - don't retry immediately
            return self._retry_with_delay(request, "DNS lookup failed", spider, delay=120)
        else:
            return super().process_exception(request, exception, spider)

    def _retry_with_delay(self, request, reason, spider, delay):
        time.sleep(delay)
        return self._retry(request, reason, spider)

JavaScript Code Example for Comparison

For developers familiar with Node.js, here's how similar retry logic might look in JavaScript:

const axios = require('axios');

class RetryClient {
    constructor(maxRetries = 3, baseDelay = 1000) {
        this.maxRetries = maxRetries;
        this.baseDelay = baseDelay;
    }

    async fetchWithRetry(url, options = {}) {
        let lastError;

        for (let attempt = 0; attempt <= this.maxRetries; attempt++) {
            try {
                const response = await axios.get(url, options);
                return response.data;
            } catch (error) {
                lastError = error;

                if (attempt === this.maxRetries) {
                    throw error;
                }

                // Check if error is retryable
                if (this.isRetryableError(error)) {
                    const delay = this.calculateDelay(attempt);
                    console.log(`Retrying ${url} in ${delay}ms (attempt ${attempt + 1})`);
                    await this.sleep(delay);
                } else {
                    throw error;
                }
            }
        }
    }

    isRetryableError(error) {
        if (!error.response) return true; // Network error
        const status = error.response.status;
        return [500, 502, 503, 504, 408, 429].includes(status);
    }

    calculateDelay(attempt) {
        // Exponential backoff with jitter
        return Math.min(300000, (Math.pow(2, attempt) * this.baseDelay) + Math.random() * 1000);
    }

    sleep(ms) {
        return new Promise(resolve => setTimeout(resolve, ms));
    }
}

// Usage
const client = new RetryClient(5, 1000);
client.fetchWithRetry('https://api.example.com/data')
    .then(data => console.log(data))
    .catch(error => console.error('Failed after all retries:', error));

Testing Retry Logic

Create tests to verify your retry implementation:

# test_retry.py
import unittest
from unittest.mock import Mock, patch
from scrapy.http import Request, Response
from scrapy.spiders import Spider
from myproject.middlewares import CustomRetryMiddleware

class TestRetryMiddleware(unittest.TestCase):

    def setUp(self):
        self.spider = Spider('test')
        self.middleware = CustomRetryMiddleware({
            'RETRY_TIMES': 3,
            'RETRY_HTTP_CODES': [500, 502, 503],
            'RETRY_PRIORITY_ADJUST': -1
        })

    def test_retry_on_server_error(self):
        request = Request('http://example.com')
        response = Response('http://example.com', status=500)

        result = self.middleware.process_response(request, response, self.spider)

        self.assertIsInstance(result, Request)
        self.assertEqual(result.meta['retry_times'], 1)

    def test_no_retry_on_success(self):
        request = Request('http://example.com')
        response = Response('http://example.com', status=200)

        result = self.middleware.process_response(request, response, self.spider)

        self.assertEqual(result, response)

    def test_max_retries_exceeded(self):
        request = Request('http://example.com', meta={'retry_times': 3})
        response = Response('http://example.com', status=500)

        result = self.middleware.process_response(request, response, self.spider)

        self.assertEqual(result, response)  # Should not retry anymore

Command Line Testing

Test your retry logic using Scrapy's built-in tools:

# Run spider with verbose logging to see retry attempts
scrapy crawl myspider -L DEBUG

# Test specific URLs with retry logic
scrapy shell "https://httpstat.us/500"

# Monitor retry statistics
scrapy crawl myspider -s RETRY_TIMES=5 -s LOG_LEVEL=INFO

Monitoring and Debugging

Set up monitoring to track retry performance:

# Custom stats collection
class RetryStatsMiddleware(RetryMiddleware):

    def process_response(self, request, response, spider):
        # Track response codes
        spider.crawler.stats.inc_value(f'response_status_count/{response.status}')

        if response.status in self.retry_http_codes:
            spider.crawler.stats.inc_value('retry_triggered_count')
            spider.crawler.stats.inc_value(f'retry_status/{response.status}')

        return super().process_response(request, response, spider)

Enable stats collection:

# View stats after spider completion
scrapy crawl myspider -s STATS_CLASS=scrapy.statscollectors.MemoryStatsCollector

Conclusion

Implementing effective retry logic in Scrapy involves understanding the built-in retry middleware, creating custom retry strategies for specific needs, and following best practices for robust web scraping. Key considerations include:

  • Using appropriate HTTP status codes for retries
  • Implementing exponential backoff to avoid overwhelming servers
  • Adding comprehensive logging and monitoring
  • Testing retry logic thoroughly
  • Considering different failure modes and recovery strategies

Much like implementing error handling patterns in other scraping tools, proper retry logic ensures your Scrapy spiders can handle the unpredictable nature of web scraping while maintaining efficiency and respecting target websites.

By combining Scrapy's built-in retry capabilities with custom middleware tailored to your specific requirements, you can build resilient scrapers that gracefully handle failures and maximize data collection success rates. Whether you're dealing with temporary network issues, rate limiting, or server errors, a well-implemented retry strategy is essential for production-ready web scraping applications.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon