How do I set custom headers in Scrapy?

Setting custom headers in Scrapy is essential for successful web scraping, allowing you to mimic real browser behavior, authenticate requests, and bypass basic anti-bot measures. This comprehensive guide covers all the methods to set custom headers in your Scrapy projects.

Why Custom Headers Matter in Web Scraping

Custom headers serve several critical purposes in web scraping:

User-Agent spoofing: Mimic real browsers to avoid detection
Authentication: Include API keys, tokens, or session cookies
Content negotiation: Specify preferred response formats
Referrer spoofing: Simulate natural browsing patterns
Anti-bot evasion: Bypass basic detection mechanisms

Method 1: Setting Headers in Spider Requests

The most straightforward approach is to set headers directly in your spider's request methods:

import scrapy

class MySpider(scrapy.Spider):
    name = 'example_spider'
    start_urls = ['https://example.com']

    def start_requests(self):
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'Accept-Language': 'en-US,en;q=0.5',
            'Accept-Encoding': 'gzip, deflate',
            'DNT': '1',
            'Connection': 'keep-alive',
            'Upgrade-Insecure-Requests': '1'
        }

        for url in self.start_urls:
            yield scrapy.Request(
                url=url,
                headers=headers,
                callback=self.parse
            )

    def parse(self, response):
        # Extract data here
        for link in response.css('a::attr(href)').getall():
            yield scrapy.Request(
                url=response.urljoin(link),
                headers={
                    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)',
                    'Referer': response.url
                },
                callback=self.parse_detail
            )

    def parse_detail(self, response):
        yield {
            'title': response.css('h1::text').get(),
            'url': response.url
        }

Method 2: Using Custom Settings

Define default headers at the spider or project level using Scrapy settings:

# In your spider class
class MySpider(scrapy.Spider):
    name = 'example_spider'

    custom_settings = {
        'DEFAULT_REQUEST_HEADERS': {
            'User-Agent': 'MyBot 1.0',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'Accept-Language': 'en',
        }
    }

Or in your settings.py file:

# settings.py
DEFAULT_REQUEST_HEADERS = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en',
    'User-Agent': 'Mozilla/5.0 (compatible; MyBot/1.0)',
}

Method 3: Using Middleware for Dynamic Headers

Create custom middleware to set headers dynamically based on request properties:

# middlewares.py
import random

class CustomHeadersMiddleware:
    def __init__(self):
        self.user_agents = [
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
            'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36'
        ]

    def process_request(self, request, spider):
        # Set random User-Agent
        request.headers['User-Agent'] = random.choice(self.user_agents)

        # Set custom headers based on domain
        if 'api.example.com' in request.url:
            request.headers['Authorization'] = 'Bearer YOUR_API_TOKEN'
            request.headers['Content-Type'] = 'application/json'

        # Add timestamp header
        import time
        request.headers['X-Timestamp'] = str(int(time.time()))

        return None

Enable the middleware in settings.py:

# settings.py
DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.CustomHeadersMiddleware': 543,
}

Method 4: Authentication Headers

For APIs requiring authentication, set appropriate headers:

class APISpider(scrapy.Spider):
    name = 'api_spider'

    def start_requests(self):
        headers = {
            'Authorization': 'Bearer your_access_token_here',
            'Content-Type': 'application/json',
            'Accept': 'application/json',
            'X-API-Key': 'your_api_key_here'
        }

        yield scrapy.Request(
            url='https://api.example.com/data',
            headers=headers,
            callback=self.parse_api_response
        )

    def parse_api_response(self, response):
        data = response.json()
        for item in data.get('results', []):
            yield item

Method 5: Rotating Headers with Scrapy-User-Agents

Install and use the scrapy-user-agents package for automatic User-Agent rotation:

pip install scrapy-user-agents

# settings.py
DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
}

Advanced Header Management

Cookie Headers

Handle cookies explicitly when needed:

def start_requests(self):
    cookies = {
        'session_id': 'abc123',
        'csrf_token': 'xyz789'
    }

    headers = {
        'User-Agent': 'Mozilla/5.0 (compatible; MyBot/1.0)',
        'Cookie': '; '.join([f'{k}={v}' for k, v in cookies.items()])
    }

    yield scrapy.Request(
        url='https://example.com/protected',
        headers=headers,
        cookies=cookies,  # Alternative to Cookie header
        callback=self.parse
    )

Conditional Headers

Set headers based on request conditions:

def make_request(self, url, is_mobile=False):
    headers = {
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.9',
    }

    if is_mobile:
        headers.update({
            'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X)',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'
        })
    else:
        headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'

    return scrapy.Request(url=url, headers=headers, callback=self.parse)

Best Practices for Header Management

1. Header Consistency

Maintain consistent headers that match real browser behavior:

REALISTIC_HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.9',
    'Accept-Encoding': 'gzip, deflate, br',
    'DNT': '1',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1',
    'Sec-Fetch-Dest': 'document',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-Site': 'none',
    'Cache-Control': 'max-age=0'
}

2. Header Rotation

Implement header rotation to avoid detection patterns. While Scrapy handles many aspects of web scraping efficiently, for more complex scenarios involving JavaScript-heavy sites, you might want to explore browser automation tools for handling dynamic content.

3. Debugging Headers

Log headers for debugging purposes:

class DebugHeadersSpider(scrapy.Spider):
    name = 'debug_spider'

    def parse(self, response):
        self.logger.info(f"Request headers: {response.request.headers}")
        self.logger.info(f"Response headers: {response.headers}")

        # Check if specific headers were sent
        user_agent = response.request.headers.get('User-Agent')
        self.logger.info(f"Sent User-Agent: {user_agent}")

Common Header Scenarios

E-commerce Sites

ECOMMERCE_HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Accept-Encoding': 'gzip, deflate',
    'Referer': 'https://www.google.com/',
    'DNT': '1',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1'
}

API Endpoints

API_HEADERS = {
    'User-Agent': 'MyApp/1.0.0',
    'Accept': 'application/json',
    'Content-Type': 'application/json',
    'Authorization': 'Bearer token_here',
    'X-Requested-With': 'XMLHttpRequest'
}

Testing Header Configuration

Verify your headers are working correctly:

def start_requests(self):
    # Test headers using httpbin.org
    test_headers = {
        'User-Agent': 'Custom-Bot/1.0',
        'Custom-Header': 'test-value'
    }

    yield scrapy.Request(
        url='http://httpbin.org/headers',
        headers=test_headers,
        callback=self.verify_headers
    )

def verify_headers(self, response):
    headers_data = response.json()
    sent_headers = headers_data.get('headers', {})
    self.logger.info(f"Headers received by server: {sent_headers}")

For more advanced anti-detection techniques, you might also want to learn about session management strategies that complement proper header configuration.

Conclusion

Setting custom headers in Scrapy is crucial for successful web scraping projects. Whether you need simple User-Agent spoofing or complex authentication headers, Scrapy provides multiple flexible approaches. Start with basic header setting in requests, then implement middleware for more complex scenarios. Always test your headers and monitor for detection to ensure your scraping operations remain effective.

Remember to respect websites' robots.txt files and terms of service, and implement appropriate delays between requests to avoid overwhelming target servers.

Table of contents

How do I set custom headers in Scrapy?

Why Custom Headers Matter in Web Scraping

Method 1: Setting Headers in Spider Requests

Method 2: Using Custom Settings

Method 3: Using Middleware for Dynamic Headers

Method 4: Authentication Headers

Method 5: Rotating Headers with Scrapy-User-Agents

Advanced Header Management

Cookie Headers

Conditional Headers

Best Practices for Header Management

1. Header Consistency

2. Header Rotation

3. Debugging Headers

Common Header Scenarios

E-commerce Sites

API Endpoints

Testing Header Configuration

Conclusion

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

📖 Related Blog Guides

Python Web Scraping Libraries

Web Scraping with Python

Related Questions

How do I use proxy servers with Scrapy?

How do I implement rate limiting in Scrapy?

How do I handle JavaScript-rendered pages in Scrapy?

Get Started Now

Support