Table of contents

What are Scrapy middlewares and how do I use them?

Scrapy middlewares are powerful components that sit between Scrapy's engine and spiders, allowing you to process requests and responses globally across your scraping project. They provide a clean way to implement cross-cutting concerns like authentication, proxy rotation, user agent switching, and custom request/response processing without cluttering your spider code.

Understanding Scrapy Middleware Types

Scrapy provides three main types of middlewares:

1. Downloader Middlewares

These process requests before they're sent to websites and responses before they reach your spider.

2. Spider Middlewares

These process spider input (responses) and output (items and requests).

3. Extension Middlewares

These provide additional functionality like stats collection, logging, and telnet console.

How Middlewares Work

Middlewares follow a pipeline pattern where each middleware can: - Process requests before they're sent to the target website - Process responses before they reach your spider - Process exceptions when requests fail - Filter or modify the data flowing through the pipeline

The processing order is determined by the middleware's priority value - lower numbers execute first for requests, and higher numbers execute first for responses.

Creating Custom Downloader Middlewares

Here's how to create a custom downloader middleware:

# middlewares.py
import random
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware

class RotateUserAgentMiddleware(UserAgentMiddleware):
    """Middleware to rotate User-Agent headers"""

    def __init__(self, user_agent=''):
        self.user_agent = user_agent
        self.user_agent_list = [
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
            'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36',
        ]

    def process_request(self, request, spider):
        """Called for each request before it's sent"""
        user_agent = random.choice(self.user_agent_list)
        request.headers['User-Agent'] = user_agent
        return None

    def process_response(self, request, response, spider):
        """Called for each response after it's received"""
        # Log successful responses
        spider.logger.info(f"Response {response.status} from {response.url}")
        return response

    def process_exception(self, request, exception, spider):
        """Called when a request generates an exception"""
        spider.logger.error(f"Exception {exception} for {request.url}")
        return None

Proxy Rotation Middleware

Here's a middleware for rotating proxy servers:

# middlewares.py
import random
from scrapy.exceptions import NotConfigured

class ProxyMiddleware:
    """Middleware for rotating proxy servers"""

    def __init__(self, proxy_list):
        self.proxy_list = proxy_list

    @classmethod
    def from_crawler(cls, crawler):
        proxy_list = crawler.settings.getlist("PROXY_LIST")
        if not proxy_list:
            raise NotConfigured("PROXY_LIST setting is required")
        return cls(proxy_list)

    def process_request(self, request, spider):
        proxy = random.choice(self.proxy_list)
        request.meta['proxy'] = proxy
        spider.logger.info(f"Using proxy {proxy} for {request.url}")

Authentication Middleware

For handling authentication across requests:

# middlewares.py
class AuthenticationMiddleware:
    """Middleware for handling authentication"""

    def __init__(self, api_key):
        self.api_key = api_key

    @classmethod
    def from_crawler(cls, crawler):
        api_key = crawler.settings.get("API_KEY")
        return cls(api_key)

    def process_request(self, request, spider):
        if self.api_key:
            request.headers['Authorization'] = f'Bearer {self.api_key}'
        return None

Custom Spider Middleware

Spider middlewares process spider output:

# middlewares.py
from scrapy.exceptions import DropItem

class ItemValidationMiddleware:
    """Middleware to validate scraped items"""

    def process_spider_output(self, response, result, spider):
        for item in result:
            if isinstance(item, dict):
                # Validate required fields
                if not item.get('title') or not item.get('price'):
                    spider.logger.warning(f"Dropping invalid item: {item}")
                    raise DropItem("Missing required fields")
                # Clean data
                item['price'] = item['price'].replace('$', '').strip()
            yield item

Configuring Middlewares in Settings

Add your middlewares to the settings.py file:

# settings.py

# Downloader middlewares
DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.RotateUserAgentMiddleware': 400,
    'myproject.middlewares.ProxyMiddleware': 350,
    'myproject.middlewares.AuthenticationMiddleware': 300,
}

# Spider middlewares  
SPIDER_MIDDLEWARES = {
    'myproject.middlewares.ItemValidationMiddleware': 800,
}

# Custom settings for middlewares
PROXY_LIST = [
    'http://proxy1.example.com:8080',
    'http://proxy2.example.com:8080',
    'http://proxy3.example.com:8080',
]

API_KEY = 'your-api-key-here'

Built-in Useful Middlewares

Scrapy provides several built-in middlewares you can enable:

# Enable built-in middlewares
DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590,
    'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,
    'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,
}

Advanced Middleware Patterns

Conditional Processing

Process requests based on specific conditions:

class ConditionalMiddleware:
    def process_request(self, request, spider):
        # Only process certain domains
        if 'example.com' in request.url:
            request.headers['Special-Header'] = 'value'
        return None

Rate Limiting

Implement custom rate limiting:

import time
from collections import defaultdict

class RateLimitMiddleware:
    def __init__(self):
        self.last_request_time = defaultdict(float)
        self.delay = 1.0  # 1 second delay between requests

    def process_request(self, request, spider):
        domain = request.url.split('/')[2]
        current_time = time.time()
        time_since_last = current_time - self.last_request_time[domain]

        if time_since_last < self.delay:
            time.sleep(self.delay - time_since_last)

        self.last_request_time[domain] = time.time()

Testing Middlewares

Create unit tests for your middlewares:

# test_middlewares.py
import unittest
from scrapy.http import Request, Response
from scrapy.spiders import Spider
from myproject.middlewares import RotateUserAgentMiddleware

class TestRotateUserAgentMiddleware(unittest.TestCase):
    def setUp(self):
        self.middleware = RotateUserAgentMiddleware()
        self.spider = Spider('test')

    def test_user_agent_rotation(self):
        request = Request('http://example.com')
        self.middleware.process_request(request, self.spider)

        self.assertIn('User-Agent', request.headers)
        user_agent = request.headers['User-Agent'].decode()
        self.assertIn('Mozilla', user_agent)

Best Practices for Middleware Development

1. Keep Middlewares Focused

Each middleware should have a single responsibility:

# Good: Focused on one concern
class ProxyRotationMiddleware:
    pass

# Bad: Mixing concerns  
class ProxyAndUserAgentAndAuthMiddleware:
    pass

2. Handle Errors Gracefully

Always include proper error handling:

def process_request(self, request, spider):
    try:
        # Middleware logic here
        pass
    except Exception as e:
        spider.logger.error(f"Middleware error: {e}")
        return None  # Let request continue

3. Use Appropriate Priority Values

Set priorities to ensure correct execution order:

DOWNLOADER_MIDDLEWARES = {
    'auth.AuthMiddleware': 100,        # First
    'proxy.ProxyMiddleware': 200,      # Second  
    'useragent.UAMiddleware': 300,     # Third
}

Common Use Cases

Middlewares are particularly useful for:

  • Rotating proxies and user agents to avoid detection
  • Implementing authentication across all requests
  • Adding custom headers for API access
  • Handling rate limiting and delays
  • Processing responses before they reach spiders
  • Logging and monitoring request/response cycles
  • Filtering invalid requests or responses

Similar to how browser automation tools handle authentication, Scrapy middlewares provide a centralized way to manage authentication and other cross-cutting concerns across your entire scraping project.

Debugging Middlewares

Enable detailed logging to debug middleware behavior:

# settings.py
LOG_LEVEL = 'DEBUG'

# In your middleware
import logging

class MyMiddleware:
    def process_request(self, request, spider):
        logging.debug(f"Processing request: {request.url}")
        # Middleware logic

Scrapy middlewares are essential for building robust, scalable web scraping solutions. They provide the flexibility to implement complex request/response processing logic while keeping your spider code clean and focused on data extraction. Whether you're rotating proxies, handling authentication, or implementing custom rate limiting, middlewares give you the power to customize Scrapy's behavior at every step of the scraping process.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon