How do I set custom headers in Scrapy?
Setting custom headers in Scrapy is essential for successful web scraping, allowing you to mimic real browser behavior, authenticate requests, and bypass basic anti-bot measures. This comprehensive guide covers all the methods to set custom headers in your Scrapy projects.
Why Custom Headers Matter in Web Scraping
Custom headers serve several critical purposes in web scraping:
- User-Agent spoofing: Mimic real browsers to avoid detection
- Authentication: Include API keys, tokens, or session cookies
- Content negotiation: Specify preferred response formats
- Referrer spoofing: Simulate natural browsing patterns
- Anti-bot evasion: Bypass basic detection mechanisms
Method 1: Setting Headers in Spider Requests
The most straightforward approach is to set headers directly in your spider's request methods:
import scrapy
class MySpider(scrapy.Spider):
name = 'example_spider'
start_urls = ['https://example.com']
def start_requests(self):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate',
'DNT': '1',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1'
}
for url in self.start_urls:
yield scrapy.Request(
url=url,
headers=headers,
callback=self.parse
)
def parse(self, response):
# Extract data here
for link in response.css('a::attr(href)').getall():
yield scrapy.Request(
url=response.urljoin(link),
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)',
'Referer': response.url
},
callback=self.parse_detail
)
def parse_detail(self, response):
yield {
'title': response.css('h1::text').get(),
'url': response.url
}
Method 2: Using Custom Settings
Define default headers at the spider or project level using Scrapy settings:
# In your spider class
class MySpider(scrapy.Spider):
name = 'example_spider'
custom_settings = {
'DEFAULT_REQUEST_HEADERS': {
'User-Agent': 'MyBot 1.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
}
}
Or in your settings.py
file:
# settings.py
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'User-Agent': 'Mozilla/5.0 (compatible; MyBot/1.0)',
}
Method 3: Using Middleware for Dynamic Headers
Create custom middleware to set headers dynamically based on request properties:
# middlewares.py
import random
class CustomHeadersMiddleware:
def __init__(self):
self.user_agents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36'
]
def process_request(self, request, spider):
# Set random User-Agent
request.headers['User-Agent'] = random.choice(self.user_agents)
# Set custom headers based on domain
if 'api.example.com' in request.url:
request.headers['Authorization'] = 'Bearer YOUR_API_TOKEN'
request.headers['Content-Type'] = 'application/json'
# Add timestamp header
import time
request.headers['X-Timestamp'] = str(int(time.time()))
return None
Enable the middleware in settings.py
:
# settings.py
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.CustomHeadersMiddleware': 543,
}
Method 4: Authentication Headers
For APIs requiring authentication, set appropriate headers:
class APISpider(scrapy.Spider):
name = 'api_spider'
def start_requests(self):
headers = {
'Authorization': 'Bearer your_access_token_here',
'Content-Type': 'application/json',
'Accept': 'application/json',
'X-API-Key': 'your_api_key_here'
}
yield scrapy.Request(
url='https://api.example.com/data',
headers=headers,
callback=self.parse_api_response
)
def parse_api_response(self, response):
data = response.json()
for item in data.get('results', []):
yield item
Method 5: Rotating Headers with Scrapy-User-Agents
Install and use the scrapy-user-agents
package for automatic User-Agent rotation:
pip install scrapy-user-agents
# settings.py
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
}
Advanced Header Management
Cookie Headers
Handle cookies explicitly when needed:
def start_requests(self):
cookies = {
'session_id': 'abc123',
'csrf_token': 'xyz789'
}
headers = {
'User-Agent': 'Mozilla/5.0 (compatible; MyBot/1.0)',
'Cookie': '; '.join([f'{k}={v}' for k, v in cookies.items()])
}
yield scrapy.Request(
url='https://example.com/protected',
headers=headers,
cookies=cookies, # Alternative to Cookie header
callback=self.parse
)
Conditional Headers
Set headers based on request conditions:
def make_request(self, url, is_mobile=False):
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.9',
}
if is_mobile:
headers.update({
'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X)',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'
})
else:
headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
return scrapy.Request(url=url, headers=headers, callback=self.parse)
Best Practices for Header Management
1. Header Consistency
Maintain consistent headers that match real browser behavior:
REALISTIC_HEADERS = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'DNT': '1',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Cache-Control': 'max-age=0'
}
2. Header Rotation
Implement header rotation to avoid detection patterns. While Scrapy handles many aspects of web scraping efficiently, for more complex scenarios involving JavaScript-heavy sites, you might want to explore browser automation tools for handling dynamic content.
3. Debugging Headers
Log headers for debugging purposes:
class DebugHeadersSpider(scrapy.Spider):
name = 'debug_spider'
def parse(self, response):
self.logger.info(f"Request headers: {response.request.headers}")
self.logger.info(f"Response headers: {response.headers}")
# Check if specific headers were sent
user_agent = response.request.headers.get('User-Agent')
self.logger.info(f"Sent User-Agent: {user_agent}")
Common Header Scenarios
E-commerce Sites
ECOMMERCE_HEADERS = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate',
'Referer': 'https://www.google.com/',
'DNT': '1',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1'
}
API Endpoints
API_HEADERS = {
'User-Agent': 'MyApp/1.0.0',
'Accept': 'application/json',
'Content-Type': 'application/json',
'Authorization': 'Bearer token_here',
'X-Requested-With': 'XMLHttpRequest'
}
Testing Header Configuration
Verify your headers are working correctly:
def start_requests(self):
# Test headers using httpbin.org
test_headers = {
'User-Agent': 'Custom-Bot/1.0',
'Custom-Header': 'test-value'
}
yield scrapy.Request(
url='http://httpbin.org/headers',
headers=test_headers,
callback=self.verify_headers
)
def verify_headers(self, response):
headers_data = response.json()
sent_headers = headers_data.get('headers', {})
self.logger.info(f"Headers received by server: {sent_headers}")
For more advanced anti-detection techniques, you might also want to learn about session management strategies that complement proper header configuration.
Conclusion
Setting custom headers in Scrapy is crucial for successful web scraping projects. Whether you need simple User-Agent spoofing or complex authentication headers, Scrapy provides multiple flexible approaches. Start with basic header setting in requests, then implement middleware for more complex scenarios. Always test your headers and monitor for detection to ensure your scraping operations remain effective.
Remember to respect websites' robots.txt files and terms of service, and implement appropriate delays between requests to avoid overwhelming target servers.