How do I implement rate limiting in Scrapy?
Rate limiting is crucial for responsible web scraping to prevent overwhelming target servers and avoid getting blocked. Scrapy provides several built-in mechanisms and customization options to control the rate of requests sent to websites. This guide covers various approaches from basic delays to advanced throttling strategies.
Understanding Rate Limiting in Scrapy
Rate limiting controls how frequently your spider sends requests to a website. Without proper rate limiting, you risk:
- Getting blocked by anti-bot systems
- Overloading the target server
- Receiving HTTP 429 (Too Many Requests) errors
- Poor server performance for other users
Method 1: Basic Download Delay
The simplest approach is setting a fixed delay between requests using the DOWNLOAD_DELAY
setting.
Global Download Delay
Add this to your settings.py
file:
# settings.py
DOWNLOAD_DELAY = 2 # 2 seconds delay between requests
RANDOMIZE_DOWNLOAD_DELAY = 0.5 # Randomize delay (0.5 * to 1.5 * DOWNLOAD_DELAY)
Per-Spider Download Delay
You can also set delays specific to individual spiders:
# my_spider.py
import scrapy
class MySpider(scrapy.Spider):
name = 'rate_limited_spider'
download_delay = 3 # 3 seconds delay for this spider
randomize_download_delay = 0.5
def start_requests(self):
urls = ['https://example.com/page1', 'https://example.com/page2']
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
# Extract data here
yield {'title': response.css('title::text').get()}
Method 2: AutoThrottle Extension (Recommended)
AutoThrottle is Scrapy's intelligent rate limiting system that automatically adjusts delays based on server response times and latency.
Enabling AutoThrottle
Add these settings to your settings.py
:
# settings.py
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1 # Initial delay
AUTOTHROTTLE_MAX_DELAY = 60 # Maximum delay
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Average concurrent requests
AUTOTHROTTLE_DEBUG = True # Enable to see throttling stats
# Optional: Custom concurrency settings
CONCURRENT_REQUESTS = 16
CONCURRENT_REQUESTS_PER_DOMAIN = 8
AutoThrottle Configuration Options
- AUTOTHROTTLE_START_DELAY: Initial delay (in seconds)
- AUTOTHROTTLE_MAX_DELAY: Maximum delay cap
- AUTOTHROTTLE_TARGET_CONCURRENCY: Target number of concurrent requests
- AUTOTHROTTLE_DEBUG: Shows throttling statistics in logs
Advanced AutoThrottle Configuration
# settings.py - Advanced AutoThrottle setup
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 0.5
AUTOTHROTTLE_MAX_DELAY = 30
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0
AUTOTHROTTLE_DEBUG = True
# Fine-tune concurrency
CONCURRENT_REQUESTS = 32
CONCURRENT_REQUESTS_PER_DOMAIN = 4
DOWNLOAD_TIMEOUT = 180
Method 3: Custom Rate Limiting with Middleware
For more sophisticated rate limiting, create custom middleware:
# middlewares.py
import time
from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.utils.response import response_status_message
class CustomRateLimitMiddleware:
def __init__(self, delay=1.0, randomize=False):
self.delay = delay
self.randomize = randomize
self.last_request_time = {}
@classmethod
def from_crawler(cls, crawler):
settings = crawler.settings
delay = settings.getfloat('CUSTOM_DELAY', 1.0)
randomize = settings.getbool('CUSTOM_RANDOMIZE', False)
return cls(delay, randomize)
def process_request(self, request, spider):
domain = request.url.split('/')[2]
current_time = time.time()
if domain in self.last_request_time:
elapsed = current_time - self.last_request_time[domain]
if elapsed < self.delay:
sleep_time = self.delay - elapsed
if self.randomize:
import random
sleep_time *= random.uniform(0.5, 1.5)
time.sleep(sleep_time)
self.last_request_time[domain] = time.time()
return None
Enable the middleware in settings:
# settings.py
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.CustomRateLimitMiddleware': 585,
}
CUSTOM_DELAY = 2.0
CUSTOM_RANDOMIZE = True
Method 4: Dynamic Rate Limiting Based on Response
Implement adaptive rate limiting that responds to server behavior:
# middlewares.py
class AdaptiveRateLimitMiddleware:
def __init__(self):
self.delays = {}
self.default_delay = 1.0
self.max_delay = 10.0
self.min_delay = 0.1
def process_response(self, request, response, spider):
domain = self._get_domain(request.url)
if response.status == 429: # Too Many Requests
# Increase delay for this domain
current_delay = self.delays.get(domain, self.default_delay)
new_delay = min(current_delay * 2, self.max_delay)
self.delays[domain] = new_delay
spider.logger.info(f"Rate limited! Increasing delay to {new_delay}s for {domain}")
elif response.status == 200:
# Gradually decrease delay on successful requests
current_delay = self.delays.get(domain, self.default_delay)
new_delay = max(current_delay * 0.9, self.min_delay)
self.delays[domain] = new_delay
return response
def process_request(self, request, spider):
domain = self._get_domain(request.url)
delay = self.delays.get(domain, self.default_delay)
if delay > self.min_delay:
time.sleep(delay)
return None
def _get_domain(self, url):
return url.split('/')[2]
Method 5: Per-Domain Rate Limiting
Different websites may require different rate limiting strategies:
# spiders/multi_domain_spider.py
import scrapy
class MultiDomainSpider(scrapy.Spider):
name = 'multi_domain'
custom_settings = {
'AUTOTHROTTLE_ENABLED': True,
'AUTOTHROTTLE_START_DELAY': 1,
'AUTOTHROTTLE_MAX_DELAY': 10,
'AUTOTHROTTLE_TARGET_CONCURRENCY': 1.0,
}
def start_requests(self):
# Different delays for different domains
domains = {
'https://fast-site.com': 0.5,
'https://slow-site.com': 3.0,
'https://strict-site.com': 5.0,
}
for base_url, delay in domains.items():
meta = {'download_delay': delay}
yield scrapy.Request(
url=f"{base_url}/page1",
callback=self.parse,
meta=meta
)
def parse(self, response):
# Process response
yield {'url': response.url, 'title': response.css('title::text').get()}
Method 6: Time-Based Rate Limiting
Implement time-window-based rate limiting (e.g., max 10 requests per minute):
# middlewares.py
import time
from collections import defaultdict, deque
class TimeWindowRateLimitMiddleware:
def __init__(self, max_requests=10, time_window=60):
self.max_requests = max_requests
self.time_window = time_window
self.request_times = defaultdict(deque)
def process_request(self, request, spider):
domain = request.url.split('/')[2]
current_time = time.time()
domain_requests = self.request_times[domain]
# Remove old requests outside the time window
while domain_requests and domain_requests[0] <= current_time - self.time_window:
domain_requests.popleft()
# Check if we've exceeded the rate limit
if len(domain_requests) >= self.max_requests:
# Calculate how long to wait
oldest_request = domain_requests[0]
wait_time = self.time_window - (current_time - oldest_request)
spider.logger.info(f"Rate limit reached for {domain}. Waiting {wait_time:.2f}s")
time.sleep(wait_time)
# Record this request
domain_requests.append(current_time)
return None
Best Practices for Rate Limiting
1. Start Conservative
Begin with longer delays and gradually optimize:
# Conservative starting point
DOWNLOAD_DELAY = 3
AUTOTHROTTLE_START_DELAY = 2
AUTOTHROTTLE_MAX_DELAY = 60
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
2. Monitor Server Response
Check server response times and adjust accordingly:
# Enable detailed logging
AUTOTHROTTLE_DEBUG = True
LOG_LEVEL = 'INFO'
# Monitor response times in your spider
def parse(self, response):
download_latency = response.meta.get('download_latency', 0)
if download_latency > 5: # If response took more than 5 seconds
self.logger.warning(f"Slow response: {download_latency}s for {response.url}")
3. Respect robots.txt
Many websites specify crawl delays in their robots.txt:
# settings.py
ROBOTSTXT_OBEY = True # This will respect Crawl-delay directive
4. Handle Rate Limit Responses
Properly handle 429 and other rate limiting responses:
# settings.py
RETRY_HTTP_CODES = [500, 502, 503, 504, 408, 429]
RETRY_TIMES = 3
# Custom retry middleware for 429 responses
class RateLimitRetryMiddleware(RetryMiddleware):
def process_response(self, request, response, spider):
if response.status == 429:
retry_after = response.headers.get('Retry-After')
if retry_after:
delay = int(retry_after)
spider.logger.info(f"Rate limited. Retrying after {delay} seconds")
time.sleep(delay)
return super().process_response(request, response, spider)
Testing Rate Limiting
Test your rate limiting configuration:
# Run with debug info
scrapy crawl myspider -s AUTOTHROTTLE_DEBUG=True -L INFO
# Monitor request frequency
scrapy crawl myspider -s LOG_LEVEL=DEBUG | grep "download_delay"
Common Rate Limiting Scenarios
E-commerce Sites
DOWNLOAD_DELAY = 2
RANDOMIZE_DOWNLOAD_DELAY = 0.5
CONCURRENT_REQUESTS_PER_DOMAIN = 1
News Websites
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0
APIs with Rate Limits
# For APIs with explicit rate limits (e.g., 100 requests/hour)
DOWNLOAD_DELAY = 36 # 3600 seconds / 100 requests = 36 seconds
Rate limiting is essential for sustainable web scraping. While tools like Puppeteer handle authentication flows differently, Scrapy's built-in throttling mechanisms provide robust solutions for controlling request rates. Combined with proper error handling strategies, these techniques ensure reliable and respectful web scraping operations.
Remember that effective rate limiting balances scraping efficiency with server respect, helping maintain long-term access to target websites while avoiding detection and blocks.