How do I set custom headers with MechanicalSoup?
Setting custom headers in MechanicalSoup is essential for successful web scraping, as it allows you to mimic real browser behavior, handle authentication, and avoid detection. MechanicalSoup provides several methods to set headers at different levels - globally for the browser instance, per session, or for individual requests.
Understanding Headers in Web Scraping
HTTP headers are key-value pairs that provide additional information about requests and responses. Common headers used in web scraping include:
- User-Agent: Identifies the browser/client making the request
- Accept: Specifies content types the client can handle
- Accept-Language: Indicates preferred languages
- Referer: Shows the previous page URL
- Authorization: Contains authentication credentials
- Cookie: Stores session and tracking information
Basic Header Configuration
Setting Headers on Browser Creation
The most straightforward way to set headers in MechanicalSoup is during browser instantiation:
import mechanicalsoup
# Create browser with custom headers
browser = mechanicalsoup.StatefulBrowser(
user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
)
# Add additional headers
browser.session.headers.update({
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1'
})
# Navigate to a page
browser.open('https://example.com')
Using Session Headers
MechanicalSoup uses the requests
library under the hood, so you can leverage session-level headers:
import mechanicalsoup
browser = mechanicalsoup.StatefulBrowser()
# Set headers for all requests in this session
browser.session.headers = {
'User-Agent': 'MyCustomBot/1.0',
'Accept': 'application/json, text/plain, */*',
'Content-Type': 'application/json',
'X-Custom-Header': 'custom-value'
}
# All subsequent requests will include these headers
response = browser.open('https://api.example.com/data')
Advanced Header Management
Dynamic Header Updates
You can modify headers dynamically during your scraping session:
import mechanicalsoup
browser = mechanicalsoup.StatefulBrowser()
# Initial headers
browser.session.headers.update({
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36'
})
# Navigate to login page
browser.open('https://example.com/login')
# Update headers after login
browser.session.headers.update({
'X-Requested-With': 'XMLHttpRequest',
'Referer': 'https://example.com/login'
})
# Continue with authenticated requests
browser.open('https://example.com/dashboard')
Request-Specific Headers
For scenarios where you need different headers for specific requests, use the underlying session directly:
import mechanicalsoup
browser = mechanicalsoup.StatefulBrowser()
# Standard headers for most requests
browser.session.headers.update({
'User-Agent': 'Mozilla/5.0 (compatible; MyBot/1.0)'
})
# Special headers for API endpoint
api_headers = {
'Authorization': 'Bearer your-token-here',
'Content-Type': 'application/json',
'X-API-Version': 'v2'
}
# Make request with specific headers
response = browser.session.get(
'https://api.example.com/users',
headers=api_headers
)
Authentication Headers
Basic Authentication
For sites requiring basic authentication, set the Authorization header:
import mechanicalsoup
import base64
browser = mechanicalsoup.StatefulBrowser()
# Create basic auth header
username = 'your_username'
password = 'your_password'
credentials = base64.b64encode(f'{username}:{password}'.encode()).decode()
browser.session.headers.update({
'Authorization': f'Basic {credentials}',
'User-Agent': 'Mozilla/5.0 (compatible; AuthBot/1.0)'
})
response = browser.open('https://example.com/protected')
Bearer Token Authentication
For API authentication using bearer tokens:
import mechanicalsoup
browser = mechanicalsoup.StatefulBrowser()
# Set bearer token
token = 'your-jwt-token-here'
browser.session.headers.update({
'Authorization': f'Bearer {token}',
'Content-Type': 'application/json',
'User-Agent': 'API-Client/1.0'
})
# Access protected API
response = browser.open('https://api.example.com/protected-endpoint')
Browser Simulation Headers
Mimicking Real Browsers
To avoid detection, use realistic browser headers:
import mechanicalsoup
import random
# Define realistic user agents
user_agents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/605.1.15',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0'
]
browser = mechanicalsoup.StatefulBrowser()
# Rotate user agents
browser.session.headers.update({
'User-Agent': random.choice(user_agents),
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'DNT': '1',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none'
})
Mobile Browser Simulation
To simulate mobile devices:
import mechanicalsoup
browser = mechanicalsoup.StatefulBrowser()
# Mobile browser headers
browser.session.headers.update({
'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 14_6 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate',
'Connection': 'keep-alive'
})
Form Submission with Custom Headers
When submitting forms, you might need specific headers:
import mechanicalsoup
browser = mechanicalsoup.StatefulBrowser()
# Set headers for form submission
browser.session.headers.update({
'User-Agent': 'Mozilla/5.0 (compatible; FormBot/1.0)',
'Referer': 'https://example.com/form-page',
'Origin': 'https://example.com'
})
# Navigate to form page
browser.open('https://example.com/contact')
# Find and fill form
form = browser.select_form('form[action="/submit"]')
form.set('name', 'John Doe')
form.set('email', 'john@example.com')
# Submit with headers
response = browser.submit_selected()
Header Debugging and Monitoring
Inspecting Request Headers
To debug header issues, inspect what headers are being sent:
import mechanicalsoup
import requests
browser = mechanicalsoup.StatefulBrowser()
# Set custom headers
browser.session.headers.update({
'User-Agent': 'DebugBot/1.0',
'X-Debug': 'true'
})
# Use httpbin to see headers being sent
response = browser.open('https://httpbin.org/headers')
print(response.text) # Shows all headers sent to server
Logging Headers
For production debugging:
import mechanicalsoup
import logging
# Enable debug logging
logging.basicConfig(level=logging.DEBUG)
browser = mechanicalsoup.StatefulBrowser()
browser.session.headers.update({
'User-Agent': 'LoggedBot/1.0'
})
# Headers will be logged in debug mode
response = browser.open('https://example.com')
Best Practices
1. Realistic Headers
Always use realistic, up-to-date browser headers to avoid detection.
2. Header Rotation
For large-scale scraping, rotate headers to appear more human-like.
3. Respect Rate Limits
When using custom headers for authentication, implement proper rate limiting similar to handling browser sessions in Puppeteer.
4. Security Considerations
Never hardcode sensitive authentication tokens. Use environment variables:
import os
import mechanicalsoup
browser = mechanicalsoup.StatefulBrowser()
browser.session.headers.update({
'Authorization': f'Bearer {os.getenv("API_TOKEN")}',
'User-Agent': 'SecureBot/1.0'
})
Troubleshooting Common Issues
Headers Not Being Sent
If headers aren't being applied, ensure you're setting them on the session:
# Correct way
browser.session.headers.update({'Custom-Header': 'value'})
# Incorrect - won't persist
browser.open('https://example.com', headers={'Custom-Header': 'value'})
Encoding Issues
For headers with special characters:
import mechanicalsoup
browser = mechanicalsoup.StatefulBrowser()
browser.session.headers.update({
'User-Agent': 'MyBot/1.0',
'Accept-Language': 'en-US,en;q=0.9,es;q=0.8'
})
Integration with Other Tools
MechanicalSoup's header management integrates well with other Python libraries. For more complex scenarios requiring JavaScript execution, consider how to handle authentication in Puppeteer for browser-based solutions.
By properly configuring headers in MechanicalSoup, you can create robust web scraping solutions that handle authentication, avoid detection, and maintain reliable connections to target websites. Remember to always respect websites' terms of service and implement appropriate delays between requests.