How to Set Custom HTTP User Agents for Web Scraping

Setting custom HTTP user agents is a fundamental technique in web scraping that helps your requests appear more legitimate and reduces the likelihood of being blocked by websites. A user agent string identifies the browser, operating system, and device making the request, and many websites use this information to determine how to respond to requests.

Understanding User Agent Strings

A user agent string is an HTTP header that contains information about the client making the request. Here are some common examples:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/605.1.15
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36

Setting User Agents in Python

Using Requests Library

The most common way to set user agents in Python is using the requests library:

import requests

# Define a custom user agent
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

# Make a request with custom user agent
response = requests.get('https://example.com', headers=headers)
print(response.text)

Using urllib (Built-in Library)

import urllib.request

# Create a request with custom user agent
req = urllib.request.Request(
    'https://example.com',
    headers={
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/605.1.15'
    }
)

# Execute the request
response = urllib.request.urlopen(req)
print(response.read().decode('utf-8'))

Using Selenium WebDriver

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

# Configure Chrome options
chrome_options = Options()
chrome_options.add_argument('--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36')

# Create driver with custom user agent
driver = webdriver.Chrome(options=chrome_options)
driver.get('https://example.com')

Random User Agent Rotation

For more sophisticated scraping, rotate between multiple user agents:

import requests
import random

user_agents = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/605.1.15',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0'
]

def get_random_user_agent():
    return random.choice(user_agents)

# Use random user agent for each request
headers = {'User-Agent': get_random_user_agent()}
response = requests.get('https://example.com', headers=headers)

Setting User Agents in JavaScript

Using Node.js with Axios

const axios = require('axios');

// Set custom user agent
const headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
};

// Make request with custom user agent
axios.get('https://example.com', { headers })
    .then(response => {
        console.log(response.data);
    })
    .catch(error => {
        console.error('Error:', error);
    });

Using Node.js with Fetch

const fetch = require('node-fetch');

// Make request with custom user agent
fetch('https://example.com', {
    headers: {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/605.1.15'
    }
})
.then(response => response.text())
.then(html => {
    console.log(html);
})
.catch(error => {
    console.error('Error:', error);
});

Using Puppeteer

When working with headless browsers like Puppeteer, you can set user agents to simulate different browsers and devices. This is particularly useful when you need to handle browser sessions in Puppeteer or perform complex interactions:

const puppeteer = require('puppeteer');

(async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();

    // Set custom user agent
    await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36');

    await page.goto('https://example.com');
    const content = await page.content();
    console.log(content);

    await browser.close();
})();

Setting User Agents in Other Languages

PHP with cURL

<?php
$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, 'https://example.com');
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

$response = curl_exec($ch);
curl_close($ch);

echo $response;
?>

Go with net/http

package main

import (
    "fmt"
    "io/ioutil"
    "net/http"
)

func main() {
    client := &http.Client{}
    req, err := http.NewRequest("GET", "https://example.com", nil)
    if err != nil {
        panic(err)
    }

    req.Header.Set("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36")

    resp, err := client.Do(req)
    if err != nil {
        panic(err)
    }
    defer resp.Body.Close()

    body, err := ioutil.ReadAll(resp.Body)
    if err != nil {
        panic(err)
    }

    fmt.Println(string(body))
}

Best Practices for User Agent Management

1. Use Realistic User Agents

Always use real browser user agent strings. Avoid generic or obviously fake user agents like "MyBot/1.0" as these are easily detected.

2. Rotate User Agents

Implement user agent rotation to avoid patterns that might trigger anti-bot measures:

import itertools
import requests

user_agents = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36'
]

# Create a cycling iterator
ua_cycle = itertools.cycle(user_agents)

def make_request(url):
    headers = {'User-Agent': next(ua_cycle)}
    return requests.get(url, headers=headers)

3. Match User Agent with Other Headers

Ensure your user agent is consistent with other headers like Accept, Accept-Language, and Accept-Encoding:

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Accept-Encoding': 'gzip, deflate, br',
    'DNT': '1',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1'
}

4. Consider Mobile User Agents

For mobile-specific content, use mobile user agents:

mobile_user_agents = [
    'Mozilla/5.0 (iPhone; CPU iPhone OS 14_6 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1',
    'Mozilla/5.0 (Android 11; Mobile; rv:68.0) Gecko/68.0 Firefox/88.0',
    'Mozilla/5.0 (Linux; Android 11; SM-G991B) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.120 Mobile Safari/537.36'
]

Advanced User Agent Techniques

Using Third-Party Libraries

For Python, consider using the fake-useragent library:

pip install fake-useragent

from fake_useragent import UserAgent

ua = UserAgent()

# Get random user agent
headers = {'User-Agent': ua.random}

# Get specific browser user agent
headers_chrome = {'User-Agent': ua.chrome}
headers_firefox = {'User-Agent': ua.firefox}

Database of User Agents

Maintain a database of current user agents and update them regularly:

import sqlite3
import requests
import random

class UserAgentManager:
    def __init__(self, db_path='user_agents.db'):
        self.conn = sqlite3.connect(db_path)
        self.create_table()

    def create_table(self):
        self.conn.execute('''
            CREATE TABLE IF NOT EXISTS user_agents (
                id INTEGER PRIMARY KEY,
                user_agent TEXT UNIQUE,
                browser TEXT,
                last_used TIMESTAMP DEFAULT CURRENT_TIMESTAMP
            )
        ''')

    def add_user_agent(self, user_agent, browser):
        try:
            self.conn.execute(
                'INSERT INTO user_agents (user_agent, browser) VALUES (?, ?)',
                (user_agent, browser)
            )
            self.conn.commit()
        except sqlite3.IntegrityError:
            pass  # User agent already exists

    def get_random_user_agent(self):
        cursor = self.conn.execute('SELECT user_agent FROM user_agents ORDER BY RANDOM() LIMIT 1')
        result = cursor.fetchone()
        return result[0] if result else None

Testing User Agent Configuration

Verify Your User Agent

Test if your custom user agent is being sent correctly:

import requests

headers = {'User-Agent': 'Your Custom User Agent'}
response = requests.get('https://httpbin.org/headers', headers=headers)
print(response.json())

Check for Consistency

When using sophisticated scraping tools like Puppeteer for navigating to different pages, ensure your user agent remains consistent across all requests and page navigations.

Common Pitfalls and Solutions

1. Inconsistent Headers

Problem: Using a Chrome user agent with Firefox-specific headers.

Solution: Create header sets that match specific browsers:

chrome_headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'sec-ch-ua': '"Chromium";v="91", " Not;A Brand";v="99"',
    'sec-ch-ua-mobile': '?0'
}

2. Outdated User Agents

Problem: Using old browser versions that are easily detected.

Solution: Regularly update your user agent database with current browser versions.

3. Static User Agents

Problem: Using the same user agent for all requests.

Solution: Implement rotation and use different user agents for different sessions.

Console Commands and Testing

Testing User Agent in Terminal

Use curl to test different user agents from the command line:

# Test with Chrome user agent
curl -H "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36" https://httpbin.org/headers

# Test with Firefox user agent
curl -H "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0" https://httpbin.org/headers

Verify Response Differences

Some websites serve different content based on user agents:

# Get mobile version
curl -H "User-Agent: Mozilla/5.0 (iPhone; CPU iPhone OS 14_6 like Mac OS X) AppleWebKit/605.1.15" https://example.com

# Get desktop version
curl -H "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36" https://example.com

Integration with Web Scraping APIs

When using professional web scraping services, you can often specify custom user agents through API parameters. For example, with WebScraping.AI's API:

curl -X GET "https://api.webscraping.ai/html" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
  }'

This approach handles user agent management automatically while providing additional features like proxy rotation and CAPTCHA solving.

Conclusion

Setting custom HTTP user agents is essential for successful web scraping. By implementing proper user agent management with rotation, realistic browser strings, and consistent headers, you can significantly improve your scraping success rate while maintaining ethical scraping practices. Remember to always respect robots.txt files and website terms of service when implementing these techniques.

For more advanced scenarios involving complex page interactions and session management, consider exploring browser automation tools that provide additional control over request headers and user agent configuration, such as handling authentication in Puppeteer.

Table of contents