Table of contents

How do I manage user agents with MechanicalSoup?

Managing user agents effectively is crucial for successful web scraping with MechanicalSoup. User agents help identify your scraper as a legitimate browser, reducing the chances of being blocked by anti-bot measures. This guide covers everything you need to know about setting, rotating, and managing user agents in MechanicalSoup.

What is a User Agent?

A user agent is a string that identifies the client software (browser, crawler, or application) making HTTP requests to a web server. Websites use this information to deliver appropriate content and sometimes to block automated requests. By default, MechanicalSoup uses the requests library's user agent, which clearly identifies it as an automated tool.

Setting a Custom User Agent

Basic User Agent Configuration

The simplest way to set a user agent in MechanicalSoup is during browser initialization:

import mechanicalsoup

# Create browser with custom user agent
browser = mechanicalsoup.StatefulBrowser(
    user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
)

# Navigate to a website
browser.open("https://example.com")

Setting User Agent After Browser Creation

You can also modify the user agent after creating the browser instance:

import mechanicalsoup

# Create browser instance
browser = mechanicalsoup.StatefulBrowser()

# Set custom user agent
browser.session.headers.update({
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
})

# Make requests with the new user agent
response = browser.open("https://httpbin.org/user-agent")
print(response.text)

Popular User Agent Strings

Here are some commonly used user agent strings for different browsers and platforms:

user_agents = {
    'chrome_windows': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
    'firefox_windows': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/121.0',
    'safari_mac': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15',
    'chrome_mac': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
    'edge_windows': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.0'
}

# Use a specific user agent
browser = mechanicalsoup.StatefulBrowser(user_agent=user_agents['chrome_windows'])

Implementing User Agent Rotation

User agent rotation helps avoid detection by varying the browser identity across requests. Here's how to implement it:

Simple Random Rotation

import mechanicalsoup
import random

class RotatingUserAgentBrowser:
    def __init__(self):
        self.user_agents = [
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/121.0',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15'
        ]
        self.browser = mechanicalsoup.StatefulBrowser()

    def get_random_user_agent(self):
        return random.choice(self.user_agents)

    def open_with_rotation(self, url):
        # Set random user agent before each request
        user_agent = self.get_random_user_agent()
        self.browser.session.headers.update({'User-Agent': user_agent})
        return self.browser.open(url)

# Usage
rotating_browser = RotatingUserAgentBrowser()
response = rotating_browser.open_with_rotation("https://example.com")

Advanced Rotation with Weighted Selection

import mechanicalsoup
import random

class WeightedUserAgentBrowser:
    def __init__(self):
        # User agents with weights (higher weight = more likely to be selected)
        self.weighted_user_agents = [
            ('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36', 40),
            ('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36', 25),
            ('Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/121.0', 20),
            ('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15', 15)
        ]
        self.browser = mechanicalsoup.StatefulBrowser()

    def get_weighted_user_agent(self):
        user_agents, weights = zip(*self.weighted_user_agents)
        return random.choices(user_agents, weights=weights)[0]

    def make_request(self, url):
        user_agent = self.get_weighted_user_agent()
        self.browser.session.headers.update({'User-Agent': user_agent})
        return self.browser.open(url)

# Usage
weighted_browser = WeightedUserAgentBrowser()
response = weighted_browser.make_request("https://example.com")

Using External User Agent Libraries

For more comprehensive user agent management, consider using specialized libraries:

Using fake-useragent Library

import mechanicalsoup
from fake_useragent import UserAgent

# Install: pip install fake-useragent

class FakeUserAgentBrowser:
    def __init__(self):
        self.ua = UserAgent()
        self.browser = mechanicalsoup.StatefulBrowser()

    def open_with_fake_ua(self, url, browser_type=None):
        if browser_type:
            user_agent = getattr(self.ua, browser_type)
        else:
            user_agent = self.ua.random

        self.browser.session.headers.update({'User-Agent': user_agent})
        return self.browser.open(url)

# Usage
fake_ua_browser = FakeUserAgentBrowser()

# Use random user agent
response1 = fake_ua_browser.open_with_fake_ua("https://example.com")

# Use specific browser type
response2 = fake_ua_browser.open_with_fake_ua("https://example.com", "chrome")
response3 = fake_ua_browser.open_with_fake_ua("https://example.com", "firefox")

Best Practices for User Agent Management

1. Keep User Agents Updated

Regularly update your user agent strings to match current browser versions:

import mechanicalsoup
from datetime import datetime, timedelta

class UpdatedUserAgentBrowser:
    def __init__(self):
        self.browser = mechanicalsoup.StatefulBrowser()
        self.last_update = datetime.now()
        self.update_interval = timedelta(days=7)  # Update weekly
        self.current_user_agents = self.get_current_user_agents()

    def get_current_user_agents(self):
        # In practice, you might fetch these from an API or update manually
        return [
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/121.0'
        ]

    def should_update_user_agents(self):
        return datetime.now() - self.last_update > self.update_interval

    def make_request(self, url):
        if self.should_update_user_agents():
            self.current_user_agents = self.get_current_user_agents()
            self.last_update = datetime.now()

        user_agent = random.choice(self.current_user_agents)
        self.browser.session.headers.update({'User-Agent': user_agent})
        return self.browser.open(url)

2. Match User Agent with Other Headers

Ensure your user agent is consistent with other browser headers:

import mechanicalsoup

def create_realistic_browser():
    browser = mechanicalsoup.StatefulBrowser()

    # Set comprehensive headers that match the user agent
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.9',
        'Accept-Encoding': 'gzip, deflate, br',
        'DNT': '1',
        'Connection': 'keep-alive',
        'Upgrade-Insecure-Requests': '1',
        'Sec-Fetch-Dest': 'document',
        'Sec-Fetch-Mode': 'navigate',
        'Sec-Fetch-Site': 'none',
        'Cache-Control': 'max-age=0'
    }

    browser.session.headers.update(headers)
    return browser

# Usage
realistic_browser = create_realistic_browser()
response = realistic_browser.open("https://example.com")

3. Testing User Agent Effectiveness

Always test your user agents to ensure they're working correctly:

import mechanicalsoup

def test_user_agent(user_agent):
    browser = mechanicalsoup.StatefulBrowser(user_agent=user_agent)

    # Test with httpbin.org which echoes back the user agent
    response = browser.open("https://httpbin.org/user-agent")

    if response.status_code == 200:
        returned_ua = response.json().get('user-agent', '')
        print(f"Set UA: {user_agent}")
        print(f"Returned UA: {returned_ua}")
        print(f"Match: {user_agent == returned_ua}")
        return user_agent == returned_ua
    return False

# Test different user agents
test_agents = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15'
]

for agent in test_agents:
    success = test_user_agent(agent)
    print(f"User agent test {'passed' if success else 'failed'}\n")

Integration with Session Management

When working with websites that require login or session management, it's important to maintain consistent user agents throughout the session, similar to how you would handle authentication with MechanicalSoup:

import mechanicalsoup

class SessionAwareBrowser:
    def __init__(self, user_agent=None):
        self.browser = mechanicalsoup.StatefulBrowser()
        if user_agent:
            self.browser.session.headers.update({'User-Agent': user_agent})
        self.logged_in = False

    def login(self, login_url, username, password):
        # Open login page
        login_page = self.browser.open(login_url)

        # Fill and submit login form
        login_form = self.browser.select_form('form[action*="login"]')
        login_form.set("username", username)
        login_form.set("password", password)

        # Submit form
        response = self.browser.submit_selected()

        if "dashboard" in response.url or "welcome" in response.text.lower():
            self.logged_in = True
            print("Login successful")

        return self.logged_in

    def scrape_protected_content(self, url):
        if not self.logged_in:
            raise Exception("Must be logged in to access protected content")

        return self.browser.open(url)

# Usage
session_browser = SessionAwareBrowser(
    user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
)

# Login and maintain session with consistent user agent
if session_browser.login("https://example.com/login", "username", "password"):
    protected_content = session_browser.scrape_protected_content("https://example.com/dashboard")

Common Issues and Troubleshooting

Issue 1: User Agent Not Being Set

import mechanicalsoup

# Problem: User agent not properly set
browser = mechanicalsoup.StatefulBrowser()
browser.user_agent = "Custom User Agent"  # This won't work!

# Solution: Use session headers
browser = mechanicalsoup.StatefulBrowser()
browser.session.headers.update({'User-Agent': 'Custom User Agent'})

Issue 2: User Agent Verification

Always verify that your user agent is being sent correctly:

import mechanicalsoup

def verify_user_agent(browser, expected_ua):
    response = browser.open("https://httpbin.org/headers")
    headers = response.json().get('headers', {})
    actual_ua = headers.get('User-Agent', '')

    print(f"Expected: {expected_ua}")
    print(f"Actual: {actual_ua}")

    return expected_ua in actual_ua

# Usage
browser = mechanicalsoup.StatefulBrowser()
custom_ua = "Mozilla/5.0 (Custom Browser)"
browser.session.headers.update({'User-Agent': custom_ua})

if verify_user_agent(browser, custom_ua):
    print("User agent set correctly")
else:
    print("User agent not set properly")

Conclusion

Effective user agent management in MechanicalSoup is essential for successful web scraping. By implementing proper user agent rotation, keeping your strings updated, and ensuring consistency with other browser headers, you can significantly improve your scraping success rate while maintaining ethical scraping practices.

Remember to always respect robots.txt files, implement appropriate delays between requests, and consider the legal and ethical implications of your scraping activities. When combined with proper session and cookie management, user agent management becomes a powerful tool in your web scraping toolkit.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon