Table of contents

How do I implement rate limiting in Scrapy?

Rate limiting is crucial for responsible web scraping to prevent overwhelming target servers and avoid getting blocked. Scrapy provides several built-in mechanisms and customization options to control the rate of requests sent to websites. This guide covers various approaches from basic delays to advanced throttling strategies.

Understanding Rate Limiting in Scrapy

Rate limiting controls how frequently your spider sends requests to a website. Without proper rate limiting, you risk:

  • Getting blocked by anti-bot systems
  • Overloading the target server
  • Receiving HTTP 429 (Too Many Requests) errors
  • Poor server performance for other users

Method 1: Basic Download Delay

The simplest approach is setting a fixed delay between requests using the DOWNLOAD_DELAY setting.

Global Download Delay

Add this to your settings.py file:

# settings.py
DOWNLOAD_DELAY = 2  # 2 seconds delay between requests
RANDOMIZE_DOWNLOAD_DELAY = 0.5  # Randomize delay (0.5 * to 1.5 * DOWNLOAD_DELAY)

Per-Spider Download Delay

You can also set delays specific to individual spiders:

# my_spider.py
import scrapy

class MySpider(scrapy.Spider):
    name = 'rate_limited_spider'
    download_delay = 3  # 3 seconds delay for this spider
    randomize_download_delay = 0.5

    def start_requests(self):
        urls = ['https://example.com/page1', 'https://example.com/page2']
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        # Extract data here
        yield {'title': response.css('title::text').get()}

Method 2: AutoThrottle Extension (Recommended)

AutoThrottle is Scrapy's intelligent rate limiting system that automatically adjusts delays based on server response times and latency.

Enabling AutoThrottle

Add these settings to your settings.py:

# settings.py
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1  # Initial delay
AUTOTHROTTLE_MAX_DELAY = 60   # Maximum delay
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0  # Average concurrent requests
AUTOTHROTTLE_DEBUG = True     # Enable to see throttling stats

# Optional: Custom concurrency settings
CONCURRENT_REQUESTS = 16
CONCURRENT_REQUESTS_PER_DOMAIN = 8

AutoThrottle Configuration Options

  • AUTOTHROTTLE_START_DELAY: Initial delay (in seconds)
  • AUTOTHROTTLE_MAX_DELAY: Maximum delay cap
  • AUTOTHROTTLE_TARGET_CONCURRENCY: Target number of concurrent requests
  • AUTOTHROTTLE_DEBUG: Shows throttling statistics in logs

Advanced AutoThrottle Configuration

# settings.py - Advanced AutoThrottle setup
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 0.5
AUTOTHROTTLE_MAX_DELAY = 30
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0
AUTOTHROTTLE_DEBUG = True

# Fine-tune concurrency
CONCURRENT_REQUESTS = 32
CONCURRENT_REQUESTS_PER_DOMAIN = 4
DOWNLOAD_TIMEOUT = 180

Method 3: Custom Rate Limiting with Middleware

For more sophisticated rate limiting, create custom middleware:

# middlewares.py
import time
from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.utils.response import response_status_message

class CustomRateLimitMiddleware:
    def __init__(self, delay=1.0, randomize=False):
        self.delay = delay
        self.randomize = randomize
        self.last_request_time = {}

    @classmethod
    def from_crawler(cls, crawler):
        settings = crawler.settings
        delay = settings.getfloat('CUSTOM_DELAY', 1.0)
        randomize = settings.getbool('CUSTOM_RANDOMIZE', False)
        return cls(delay, randomize)

    def process_request(self, request, spider):
        domain = request.url.split('/')[2]
        current_time = time.time()

        if domain in self.last_request_time:
            elapsed = current_time - self.last_request_time[domain]
            if elapsed < self.delay:
                sleep_time = self.delay - elapsed
                if self.randomize:
                    import random
                    sleep_time *= random.uniform(0.5, 1.5)
                time.sleep(sleep_time)

        self.last_request_time[domain] = time.time()
        return None

Enable the middleware in settings:

# settings.py
DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.CustomRateLimitMiddleware': 585,
}

CUSTOM_DELAY = 2.0
CUSTOM_RANDOMIZE = True

Method 4: Dynamic Rate Limiting Based on Response

Implement adaptive rate limiting that responds to server behavior:

# middlewares.py
class AdaptiveRateLimitMiddleware:
    def __init__(self):
        self.delays = {}
        self.default_delay = 1.0
        self.max_delay = 10.0
        self.min_delay = 0.1

    def process_response(self, request, response, spider):
        domain = self._get_domain(request.url)

        if response.status == 429:  # Too Many Requests
            # Increase delay for this domain
            current_delay = self.delays.get(domain, self.default_delay)
            new_delay = min(current_delay * 2, self.max_delay)
            self.delays[domain] = new_delay
            spider.logger.info(f"Rate limited! Increasing delay to {new_delay}s for {domain}")

        elif response.status == 200:
            # Gradually decrease delay on successful requests
            current_delay = self.delays.get(domain, self.default_delay)
            new_delay = max(current_delay * 0.9, self.min_delay)
            self.delays[domain] = new_delay

        return response

    def process_request(self, request, spider):
        domain = self._get_domain(request.url)
        delay = self.delays.get(domain, self.default_delay)

        if delay > self.min_delay:
            time.sleep(delay)

        return None

    def _get_domain(self, url):
        return url.split('/')[2]

Method 5: Per-Domain Rate Limiting

Different websites may require different rate limiting strategies:

# spiders/multi_domain_spider.py
import scrapy

class MultiDomainSpider(scrapy.Spider):
    name = 'multi_domain'

    custom_settings = {
        'AUTOTHROTTLE_ENABLED': True,
        'AUTOTHROTTLE_START_DELAY': 1,
        'AUTOTHROTTLE_MAX_DELAY': 10,
        'AUTOTHROTTLE_TARGET_CONCURRENCY': 1.0,
    }

    def start_requests(self):
        # Different delays for different domains
        domains = {
            'https://fast-site.com': 0.5,
            'https://slow-site.com': 3.0,
            'https://strict-site.com': 5.0,
        }

        for base_url, delay in domains.items():
            meta = {'download_delay': delay}
            yield scrapy.Request(
                url=f"{base_url}/page1",
                callback=self.parse,
                meta=meta
            )

    def parse(self, response):
        # Process response
        yield {'url': response.url, 'title': response.css('title::text').get()}

Method 6: Time-Based Rate Limiting

Implement time-window-based rate limiting (e.g., max 10 requests per minute):

# middlewares.py
import time
from collections import defaultdict, deque

class TimeWindowRateLimitMiddleware:
    def __init__(self, max_requests=10, time_window=60):
        self.max_requests = max_requests
        self.time_window = time_window
        self.request_times = defaultdict(deque)

    def process_request(self, request, spider):
        domain = request.url.split('/')[2]
        current_time = time.time()
        domain_requests = self.request_times[domain]

        # Remove old requests outside the time window
        while domain_requests and domain_requests[0] <= current_time - self.time_window:
            domain_requests.popleft()

        # Check if we've exceeded the rate limit
        if len(domain_requests) >= self.max_requests:
            # Calculate how long to wait
            oldest_request = domain_requests[0]
            wait_time = self.time_window - (current_time - oldest_request)
            spider.logger.info(f"Rate limit reached for {domain}. Waiting {wait_time:.2f}s")
            time.sleep(wait_time)

        # Record this request
        domain_requests.append(current_time)
        return None

Best Practices for Rate Limiting

1. Start Conservative

Begin with longer delays and gradually optimize:

# Conservative starting point
DOWNLOAD_DELAY = 3
AUTOTHROTTLE_START_DELAY = 2
AUTOTHROTTLE_MAX_DELAY = 60
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

2. Monitor Server Response

Check server response times and adjust accordingly:

# Enable detailed logging
AUTOTHROTTLE_DEBUG = True
LOG_LEVEL = 'INFO'

# Monitor response times in your spider
def parse(self, response):
    download_latency = response.meta.get('download_latency', 0)
    if download_latency > 5:  # If response took more than 5 seconds
        self.logger.warning(f"Slow response: {download_latency}s for {response.url}")

3. Respect robots.txt

Many websites specify crawl delays in their robots.txt:

# settings.py
ROBOTSTXT_OBEY = True  # This will respect Crawl-delay directive

4. Handle Rate Limit Responses

Properly handle 429 and other rate limiting responses:

# settings.py
RETRY_HTTP_CODES = [500, 502, 503, 504, 408, 429]
RETRY_TIMES = 3

# Custom retry middleware for 429 responses
class RateLimitRetryMiddleware(RetryMiddleware):
    def process_response(self, request, response, spider):
        if response.status == 429:
            retry_after = response.headers.get('Retry-After')
            if retry_after:
                delay = int(retry_after)
                spider.logger.info(f"Rate limited. Retrying after {delay} seconds")
                time.sleep(delay)

        return super().process_response(request, response, spider)

Testing Rate Limiting

Test your rate limiting configuration:

# Run with debug info
scrapy crawl myspider -s AUTOTHROTTLE_DEBUG=True -L INFO

# Monitor request frequency
scrapy crawl myspider -s LOG_LEVEL=DEBUG | grep "download_delay"

Common Rate Limiting Scenarios

E-commerce Sites

DOWNLOAD_DELAY = 2
RANDOMIZE_DOWNLOAD_DELAY = 0.5
CONCURRENT_REQUESTS_PER_DOMAIN = 1

News Websites

AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0

APIs with Rate Limits

# For APIs with explicit rate limits (e.g., 100 requests/hour)
DOWNLOAD_DELAY = 36  # 3600 seconds / 100 requests = 36 seconds

Rate limiting is essential for sustainable web scraping. While tools like Puppeteer handle authentication flows differently, Scrapy's built-in throttling mechanisms provide robust solutions for controlling request rates. Combined with proper error handling strategies, these techniques ensure reliable and respectful web scraping operations.

Remember that effective rate limiting balances scraping efficiency with server respect, helping maintain long-term access to target websites while avoiding detection and blocks.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon