Table of contents

How do I set custom headers in Scrapy?

Setting custom headers in Scrapy is essential for successful web scraping, allowing you to mimic real browser behavior, authenticate requests, and bypass basic anti-bot measures. This comprehensive guide covers all the methods to set custom headers in your Scrapy projects.

Why Custom Headers Matter in Web Scraping

Custom headers serve several critical purposes in web scraping:

  • User-Agent spoofing: Mimic real browsers to avoid detection
  • Authentication: Include API keys, tokens, or session cookies
  • Content negotiation: Specify preferred response formats
  • Referrer spoofing: Simulate natural browsing patterns
  • Anti-bot evasion: Bypass basic detection mechanisms

Method 1: Setting Headers in Spider Requests

The most straightforward approach is to set headers directly in your spider's request methods:

import scrapy

class MySpider(scrapy.Spider):
    name = 'example_spider'
    start_urls = ['https://example.com']

    def start_requests(self):
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'Accept-Language': 'en-US,en;q=0.5',
            'Accept-Encoding': 'gzip, deflate',
            'DNT': '1',
            'Connection': 'keep-alive',
            'Upgrade-Insecure-Requests': '1'
        }

        for url in self.start_urls:
            yield scrapy.Request(
                url=url,
                headers=headers,
                callback=self.parse
            )

    def parse(self, response):
        # Extract data here
        for link in response.css('a::attr(href)').getall():
            yield scrapy.Request(
                url=response.urljoin(link),
                headers={
                    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)',
                    'Referer': response.url
                },
                callback=self.parse_detail
            )

    def parse_detail(self, response):
        yield {
            'title': response.css('h1::text').get(),
            'url': response.url
        }

Method 2: Using Custom Settings

Define default headers at the spider or project level using Scrapy settings:

# In your spider class
class MySpider(scrapy.Spider):
    name = 'example_spider'

    custom_settings = {
        'DEFAULT_REQUEST_HEADERS': {
            'User-Agent': 'MyBot 1.0',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'Accept-Language': 'en',
        }
    }

Or in your settings.py file:

# settings.py
DEFAULT_REQUEST_HEADERS = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en',
    'User-Agent': 'Mozilla/5.0 (compatible; MyBot/1.0)',
}

Method 3: Using Middleware for Dynamic Headers

Create custom middleware to set headers dynamically based on request properties:

# middlewares.py
import random

class CustomHeadersMiddleware:
    def __init__(self):
        self.user_agents = [
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
            'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36'
        ]

    def process_request(self, request, spider):
        # Set random User-Agent
        request.headers['User-Agent'] = random.choice(self.user_agents)

        # Set custom headers based on domain
        if 'api.example.com' in request.url:
            request.headers['Authorization'] = 'Bearer YOUR_API_TOKEN'
            request.headers['Content-Type'] = 'application/json'

        # Add timestamp header
        import time
        request.headers['X-Timestamp'] = str(int(time.time()))

        return None

Enable the middleware in settings.py:

# settings.py
DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.CustomHeadersMiddleware': 543,
}

Method 4: Authentication Headers

For APIs requiring authentication, set appropriate headers:

class APISpider(scrapy.Spider):
    name = 'api_spider'

    def start_requests(self):
        headers = {
            'Authorization': 'Bearer your_access_token_here',
            'Content-Type': 'application/json',
            'Accept': 'application/json',
            'X-API-Key': 'your_api_key_here'
        }

        yield scrapy.Request(
            url='https://api.example.com/data',
            headers=headers,
            callback=self.parse_api_response
        )

    def parse_api_response(self, response):
        data = response.json()
        for item in data.get('results', []):
            yield item

Method 5: Rotating Headers with Scrapy-User-Agents

Install and use the scrapy-user-agents package for automatic User-Agent rotation:

pip install scrapy-user-agents
# settings.py
DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
}

Advanced Header Management

Cookie Headers

Handle cookies explicitly when needed:

def start_requests(self):
    cookies = {
        'session_id': 'abc123',
        'csrf_token': 'xyz789'
    }

    headers = {
        'User-Agent': 'Mozilla/5.0 (compatible; MyBot/1.0)',
        'Cookie': '; '.join([f'{k}={v}' for k, v in cookies.items()])
    }

    yield scrapy.Request(
        url='https://example.com/protected',
        headers=headers,
        cookies=cookies,  # Alternative to Cookie header
        callback=self.parse
    )

Conditional Headers

Set headers based on request conditions:

def make_request(self, url, is_mobile=False):
    headers = {
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.9',
    }

    if is_mobile:
        headers.update({
            'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X)',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'
        })
    else:
        headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'

    return scrapy.Request(url=url, headers=headers, callback=self.parse)

Best Practices for Header Management

1. Header Consistency

Maintain consistent headers that match real browser behavior:

REALISTIC_HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.9',
    'Accept-Encoding': 'gzip, deflate, br',
    'DNT': '1',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1',
    'Sec-Fetch-Dest': 'document',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-Site': 'none',
    'Cache-Control': 'max-age=0'
}

2. Header Rotation

Implement header rotation to avoid detection patterns. While Scrapy handles many aspects of web scraping efficiently, for more complex scenarios involving JavaScript-heavy sites, you might want to explore browser automation tools for handling dynamic content.

3. Debugging Headers

Log headers for debugging purposes:

class DebugHeadersSpider(scrapy.Spider):
    name = 'debug_spider'

    def parse(self, response):
        self.logger.info(f"Request headers: {response.request.headers}")
        self.logger.info(f"Response headers: {response.headers}")

        # Check if specific headers were sent
        user_agent = response.request.headers.get('User-Agent')
        self.logger.info(f"Sent User-Agent: {user_agent}")

Common Header Scenarios

E-commerce Sites

ECOMMERCE_HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Accept-Encoding': 'gzip, deflate',
    'Referer': 'https://www.google.com/',
    'DNT': '1',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1'
}

API Endpoints

API_HEADERS = {
    'User-Agent': 'MyApp/1.0.0',
    'Accept': 'application/json',
    'Content-Type': 'application/json',
    'Authorization': 'Bearer token_here',
    'X-Requested-With': 'XMLHttpRequest'
}

Testing Header Configuration

Verify your headers are working correctly:

def start_requests(self):
    # Test headers using httpbin.org
    test_headers = {
        'User-Agent': 'Custom-Bot/1.0',
        'Custom-Header': 'test-value'
    }

    yield scrapy.Request(
        url='http://httpbin.org/headers',
        headers=test_headers,
        callback=self.verify_headers
    )

def verify_headers(self, response):
    headers_data = response.json()
    sent_headers = headers_data.get('headers', {})
    self.logger.info(f"Headers received by server: {sent_headers}")

For more advanced anti-detection techniques, you might also want to learn about session management strategies that complement proper header configuration.

Conclusion

Setting custom headers in Scrapy is crucial for successful web scraping projects. Whether you need simple User-Agent spoofing or complex authentication headers, Scrapy provides multiple flexible approaches. Start with basic header setting in requests, then implement middleware for more complex scenarios. Always test your headers and monitor for detection to ensure your scraping operations remain effective.

Remember to respect websites' robots.txt files and terms of service, and implement appropriate delays between requests to avoid overwhelming target servers.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon