What are Scrapy middlewares and how do I use them?
Scrapy middlewares are powerful components that sit between Scrapy's engine and spiders, allowing you to process requests and responses globally across your scraping project. They provide a clean way to implement cross-cutting concerns like authentication, proxy rotation, user agent switching, and custom request/response processing without cluttering your spider code.
Understanding Scrapy Middleware Types
Scrapy provides three main types of middlewares:
1. Downloader Middlewares
These process requests before they're sent to websites and responses before they reach your spider.
2. Spider Middlewares
These process spider input (responses) and output (items and requests).
3. Extension Middlewares
These provide additional functionality like stats collection, logging, and telnet console.
How Middlewares Work
Middlewares follow a pipeline pattern where each middleware can: - Process requests before they're sent to the target website - Process responses before they reach your spider - Process exceptions when requests fail - Filter or modify the data flowing through the pipeline
The processing order is determined by the middleware's priority value - lower numbers execute first for requests, and higher numbers execute first for responses.
Creating Custom Downloader Middlewares
Here's how to create a custom downloader middleware:
# middlewares.py
import random
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
class RotateUserAgentMiddleware(UserAgentMiddleware):
"""Middleware to rotate User-Agent headers"""
def __init__(self, user_agent=''):
self.user_agent = user_agent
self.user_agent_list = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36',
]
def process_request(self, request, spider):
"""Called for each request before it's sent"""
user_agent = random.choice(self.user_agent_list)
request.headers['User-Agent'] = user_agent
return None
def process_response(self, request, response, spider):
"""Called for each response after it's received"""
# Log successful responses
spider.logger.info(f"Response {response.status} from {response.url}")
return response
def process_exception(self, request, exception, spider):
"""Called when a request generates an exception"""
spider.logger.error(f"Exception {exception} for {request.url}")
return None
Proxy Rotation Middleware
Here's a middleware for rotating proxy servers:
# middlewares.py
import random
from scrapy.exceptions import NotConfigured
class ProxyMiddleware:
"""Middleware for rotating proxy servers"""
def __init__(self, proxy_list):
self.proxy_list = proxy_list
@classmethod
def from_crawler(cls, crawler):
proxy_list = crawler.settings.getlist("PROXY_LIST")
if not proxy_list:
raise NotConfigured("PROXY_LIST setting is required")
return cls(proxy_list)
def process_request(self, request, spider):
proxy = random.choice(self.proxy_list)
request.meta['proxy'] = proxy
spider.logger.info(f"Using proxy {proxy} for {request.url}")
Authentication Middleware
For handling authentication across requests:
# middlewares.py
class AuthenticationMiddleware:
"""Middleware for handling authentication"""
def __init__(self, api_key):
self.api_key = api_key
@classmethod
def from_crawler(cls, crawler):
api_key = crawler.settings.get("API_KEY")
return cls(api_key)
def process_request(self, request, spider):
if self.api_key:
request.headers['Authorization'] = f'Bearer {self.api_key}'
return None
Custom Spider Middleware
Spider middlewares process spider output:
# middlewares.py
from scrapy.exceptions import DropItem
class ItemValidationMiddleware:
"""Middleware to validate scraped items"""
def process_spider_output(self, response, result, spider):
for item in result:
if isinstance(item, dict):
# Validate required fields
if not item.get('title') or not item.get('price'):
spider.logger.warning(f"Dropping invalid item: {item}")
raise DropItem("Missing required fields")
# Clean data
item['price'] = item['price'].replace('$', '').strip()
yield item
Configuring Middlewares in Settings
Add your middlewares to the settings.py
file:
# settings.py
# Downloader middlewares
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.RotateUserAgentMiddleware': 400,
'myproject.middlewares.ProxyMiddleware': 350,
'myproject.middlewares.AuthenticationMiddleware': 300,
}
# Spider middlewares
SPIDER_MIDDLEWARES = {
'myproject.middlewares.ItemValidationMiddleware': 800,
}
# Custom settings for middlewares
PROXY_LIST = [
'http://proxy1.example.com:8080',
'http://proxy2.example.com:8080',
'http://proxy3.example.com:8080',
]
API_KEY = 'your-api-key-here'
Built-in Useful Middlewares
Scrapy provides several built-in middlewares you can enable:
# Enable built-in middlewares
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590,
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,
}
Advanced Middleware Patterns
Conditional Processing
Process requests based on specific conditions:
class ConditionalMiddleware:
def process_request(self, request, spider):
# Only process certain domains
if 'example.com' in request.url:
request.headers['Special-Header'] = 'value'
return None
Rate Limiting
Implement custom rate limiting:
import time
from collections import defaultdict
class RateLimitMiddleware:
def __init__(self):
self.last_request_time = defaultdict(float)
self.delay = 1.0 # 1 second delay between requests
def process_request(self, request, spider):
domain = request.url.split('/')[2]
current_time = time.time()
time_since_last = current_time - self.last_request_time[domain]
if time_since_last < self.delay:
time.sleep(self.delay - time_since_last)
self.last_request_time[domain] = time.time()
Testing Middlewares
Create unit tests for your middlewares:
# test_middlewares.py
import unittest
from scrapy.http import Request, Response
from scrapy.spiders import Spider
from myproject.middlewares import RotateUserAgentMiddleware
class TestRotateUserAgentMiddleware(unittest.TestCase):
def setUp(self):
self.middleware = RotateUserAgentMiddleware()
self.spider = Spider('test')
def test_user_agent_rotation(self):
request = Request('http://example.com')
self.middleware.process_request(request, self.spider)
self.assertIn('User-Agent', request.headers)
user_agent = request.headers['User-Agent'].decode()
self.assertIn('Mozilla', user_agent)
Best Practices for Middleware Development
1. Keep Middlewares Focused
Each middleware should have a single responsibility:
# Good: Focused on one concern
class ProxyRotationMiddleware:
pass
# Bad: Mixing concerns
class ProxyAndUserAgentAndAuthMiddleware:
pass
2. Handle Errors Gracefully
Always include proper error handling:
def process_request(self, request, spider):
try:
# Middleware logic here
pass
except Exception as e:
spider.logger.error(f"Middleware error: {e}")
return None # Let request continue
3. Use Appropriate Priority Values
Set priorities to ensure correct execution order:
DOWNLOADER_MIDDLEWARES = {
'auth.AuthMiddleware': 100, # First
'proxy.ProxyMiddleware': 200, # Second
'useragent.UAMiddleware': 300, # Third
}
Common Use Cases
Middlewares are particularly useful for:
- Rotating proxies and user agents to avoid detection
- Implementing authentication across all requests
- Adding custom headers for API access
- Handling rate limiting and delays
- Processing responses before they reach spiders
- Logging and monitoring request/response cycles
- Filtering invalid requests or responses
Similar to how browser automation tools handle authentication, Scrapy middlewares provide a centralized way to manage authentication and other cross-cutting concerns across your entire scraping project.
Debugging Middlewares
Enable detailed logging to debug middleware behavior:
# settings.py
LOG_LEVEL = 'DEBUG'
# In your middleware
import logging
class MyMiddleware:
def process_request(self, request, spider):
logging.debug(f"Processing request: {request.url}")
# Middleware logic
Scrapy middlewares are essential for building robust, scalable web scraping solutions. They provide the flexibility to implement complex request/response processing logic while keeping your spider code clean and focused on data extraction. Whether you're rotating proxies, handling authentication, or implementing custom rate limiting, middlewares give you the power to customize Scrapy's behavior at every step of the scraping process.