How do I manage user agents with MechanicalSoup?
Managing user agents effectively is crucial for successful web scraping with MechanicalSoup. User agents help identify your scraper as a legitimate browser, reducing the chances of being blocked by anti-bot measures. This guide covers everything you need to know about setting, rotating, and managing user agents in MechanicalSoup.
What is a User Agent?
A user agent is a string that identifies the client software (browser, crawler, or application) making HTTP requests to a web server. Websites use this information to deliver appropriate content and sometimes to block automated requests. By default, MechanicalSoup uses the requests library's user agent, which clearly identifies it as an automated tool.
Setting a Custom User Agent
Basic User Agent Configuration
The simplest way to set a user agent in MechanicalSoup is during browser initialization:
import mechanicalsoup
# Create browser with custom user agent
browser = mechanicalsoup.StatefulBrowser(
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
)
# Navigate to a website
browser.open("https://example.com")
Setting User Agent After Browser Creation
You can also modify the user agent after creating the browser instance:
import mechanicalsoup
# Create browser instance
browser = mechanicalsoup.StatefulBrowser()
# Set custom user agent
browser.session.headers.update({
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
})
# Make requests with the new user agent
response = browser.open("https://httpbin.org/user-agent")
print(response.text)
Popular User Agent Strings
Here are some commonly used user agent strings for different browsers and platforms:
user_agents = {
'chrome_windows': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'firefox_windows': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/121.0',
'safari_mac': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15',
'chrome_mac': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'edge_windows': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.0'
}
# Use a specific user agent
browser = mechanicalsoup.StatefulBrowser(user_agent=user_agents['chrome_windows'])
Implementing User Agent Rotation
User agent rotation helps avoid detection by varying the browser identity across requests. Here's how to implement it:
Simple Random Rotation
import mechanicalsoup
import random
class RotatingUserAgentBrowser:
def __init__(self):
self.user_agents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/121.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15'
]
self.browser = mechanicalsoup.StatefulBrowser()
def get_random_user_agent(self):
return random.choice(self.user_agents)
def open_with_rotation(self, url):
# Set random user agent before each request
user_agent = self.get_random_user_agent()
self.browser.session.headers.update({'User-Agent': user_agent})
return self.browser.open(url)
# Usage
rotating_browser = RotatingUserAgentBrowser()
response = rotating_browser.open_with_rotation("https://example.com")
Advanced Rotation with Weighted Selection
import mechanicalsoup
import random
class WeightedUserAgentBrowser:
def __init__(self):
# User agents with weights (higher weight = more likely to be selected)
self.weighted_user_agents = [
('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36', 40),
('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36', 25),
('Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/121.0', 20),
('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15', 15)
]
self.browser = mechanicalsoup.StatefulBrowser()
def get_weighted_user_agent(self):
user_agents, weights = zip(*self.weighted_user_agents)
return random.choices(user_agents, weights=weights)[0]
def make_request(self, url):
user_agent = self.get_weighted_user_agent()
self.browser.session.headers.update({'User-Agent': user_agent})
return self.browser.open(url)
# Usage
weighted_browser = WeightedUserAgentBrowser()
response = weighted_browser.make_request("https://example.com")
Using External User Agent Libraries
For more comprehensive user agent management, consider using specialized libraries:
Using fake-useragent Library
import mechanicalsoup
from fake_useragent import UserAgent
# Install: pip install fake-useragent
class FakeUserAgentBrowser:
def __init__(self):
self.ua = UserAgent()
self.browser = mechanicalsoup.StatefulBrowser()
def open_with_fake_ua(self, url, browser_type=None):
if browser_type:
user_agent = getattr(self.ua, browser_type)
else:
user_agent = self.ua.random
self.browser.session.headers.update({'User-Agent': user_agent})
return self.browser.open(url)
# Usage
fake_ua_browser = FakeUserAgentBrowser()
# Use random user agent
response1 = fake_ua_browser.open_with_fake_ua("https://example.com")
# Use specific browser type
response2 = fake_ua_browser.open_with_fake_ua("https://example.com", "chrome")
response3 = fake_ua_browser.open_with_fake_ua("https://example.com", "firefox")
Best Practices for User Agent Management
1. Keep User Agents Updated
Regularly update your user agent strings to match current browser versions:
import mechanicalsoup
from datetime import datetime, timedelta
class UpdatedUserAgentBrowser:
def __init__(self):
self.browser = mechanicalsoup.StatefulBrowser()
self.last_update = datetime.now()
self.update_interval = timedelta(days=7) # Update weekly
self.current_user_agents = self.get_current_user_agents()
def get_current_user_agents(self):
# In practice, you might fetch these from an API or update manually
return [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/121.0'
]
def should_update_user_agents(self):
return datetime.now() - self.last_update > self.update_interval
def make_request(self, url):
if self.should_update_user_agents():
self.current_user_agents = self.get_current_user_agents()
self.last_update = datetime.now()
user_agent = random.choice(self.current_user_agents)
self.browser.session.headers.update({'User-Agent': user_agent})
return self.browser.open(url)
2. Match User Agent with Other Headers
Ensure your user agent is consistent with other browser headers:
import mechanicalsoup
def create_realistic_browser():
browser = mechanicalsoup.StatefulBrowser()
# Set comprehensive headers that match the user agent
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'DNT': '1',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Cache-Control': 'max-age=0'
}
browser.session.headers.update(headers)
return browser
# Usage
realistic_browser = create_realistic_browser()
response = realistic_browser.open("https://example.com")
3. Testing User Agent Effectiveness
Always test your user agents to ensure they're working correctly:
import mechanicalsoup
def test_user_agent(user_agent):
browser = mechanicalsoup.StatefulBrowser(user_agent=user_agent)
# Test with httpbin.org which echoes back the user agent
response = browser.open("https://httpbin.org/user-agent")
if response.status_code == 200:
returned_ua = response.json().get('user-agent', '')
print(f"Set UA: {user_agent}")
print(f"Returned UA: {returned_ua}")
print(f"Match: {user_agent == returned_ua}")
return user_agent == returned_ua
return False
# Test different user agents
test_agents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15'
]
for agent in test_agents:
success = test_user_agent(agent)
print(f"User agent test {'passed' if success else 'failed'}\n")
Integration with Session Management
When working with websites that require login or session management, it's important to maintain consistent user agents throughout the session, similar to how you would handle authentication with MechanicalSoup:
import mechanicalsoup
class SessionAwareBrowser:
def __init__(self, user_agent=None):
self.browser = mechanicalsoup.StatefulBrowser()
if user_agent:
self.browser.session.headers.update({'User-Agent': user_agent})
self.logged_in = False
def login(self, login_url, username, password):
# Open login page
login_page = self.browser.open(login_url)
# Fill and submit login form
login_form = self.browser.select_form('form[action*="login"]')
login_form.set("username", username)
login_form.set("password", password)
# Submit form
response = self.browser.submit_selected()
if "dashboard" in response.url or "welcome" in response.text.lower():
self.logged_in = True
print("Login successful")
return self.logged_in
def scrape_protected_content(self, url):
if not self.logged_in:
raise Exception("Must be logged in to access protected content")
return self.browser.open(url)
# Usage
session_browser = SessionAwareBrowser(
user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
)
# Login and maintain session with consistent user agent
if session_browser.login("https://example.com/login", "username", "password"):
protected_content = session_browser.scrape_protected_content("https://example.com/dashboard")
Common Issues and Troubleshooting
Issue 1: User Agent Not Being Set
import mechanicalsoup
# Problem: User agent not properly set
browser = mechanicalsoup.StatefulBrowser()
browser.user_agent = "Custom User Agent" # This won't work!
# Solution: Use session headers
browser = mechanicalsoup.StatefulBrowser()
browser.session.headers.update({'User-Agent': 'Custom User Agent'})
Issue 2: User Agent Verification
Always verify that your user agent is being sent correctly:
import mechanicalsoup
def verify_user_agent(browser, expected_ua):
response = browser.open("https://httpbin.org/headers")
headers = response.json().get('headers', {})
actual_ua = headers.get('User-Agent', '')
print(f"Expected: {expected_ua}")
print(f"Actual: {actual_ua}")
return expected_ua in actual_ua
# Usage
browser = mechanicalsoup.StatefulBrowser()
custom_ua = "Mozilla/5.0 (Custom Browser)"
browser.session.headers.update({'User-Agent': custom_ua})
if verify_user_agent(browser, custom_ua):
print("User agent set correctly")
else:
print("User agent not set properly")
Conclusion
Effective user agent management in MechanicalSoup is essential for successful web scraping. By implementing proper user agent rotation, keeping your strings updated, and ensuring consistency with other browser headers, you can significantly improve your scraping success rate while maintaining ethical scraping practices.
Remember to always respect robots.txt files, implement appropriate delays between requests, and consider the legal and ethical implications of your scraping activities. When combined with proper session and cookie management, user agent management becomes a powerful tool in your web scraping toolkit.