How to Set Custom HTTP User Agents for Web Scraping
Setting custom HTTP user agents is a fundamental technique in web scraping that helps your requests appear more legitimate and reduces the likelihood of being blocked by websites. A user agent string identifies the browser, operating system, and device making the request, and many websites use this information to determine how to respond to requests.
Understanding User Agent Strings
A user agent string is an HTTP header that contains information about the client making the request. Here are some common examples:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/605.1.15
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36
Setting User Agents in Python
Using Requests Library
The most common way to set user agents in Python is using the requests
library:
import requests
# Define a custom user agent
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
# Make a request with custom user agent
response = requests.get('https://example.com', headers=headers)
print(response.text)
Using urllib (Built-in Library)
import urllib.request
# Create a request with custom user agent
req = urllib.request.Request(
'https://example.com',
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/605.1.15'
}
)
# Execute the request
response = urllib.request.urlopen(req)
print(response.read().decode('utf-8'))
Using Selenium WebDriver
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
# Configure Chrome options
chrome_options = Options()
chrome_options.add_argument('--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36')
# Create driver with custom user agent
driver = webdriver.Chrome(options=chrome_options)
driver.get('https://example.com')
Random User Agent Rotation
For more sophisticated scraping, rotate between multiple user agents:
import requests
import random
user_agents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/605.1.15',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0'
]
def get_random_user_agent():
return random.choice(user_agents)
# Use random user agent for each request
headers = {'User-Agent': get_random_user_agent()}
response = requests.get('https://example.com', headers=headers)
Setting User Agents in JavaScript
Using Node.js with Axios
const axios = require('axios');
// Set custom user agent
const headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
};
// Make request with custom user agent
axios.get('https://example.com', { headers })
.then(response => {
console.log(response.data);
})
.catch(error => {
console.error('Error:', error);
});
Using Node.js with Fetch
const fetch = require('node-fetch');
// Make request with custom user agent
fetch('https://example.com', {
headers: {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/605.1.15'
}
})
.then(response => response.text())
.then(html => {
console.log(html);
})
.catch(error => {
console.error('Error:', error);
});
Using Puppeteer
When working with headless browsers like Puppeteer, you can set user agents to simulate different browsers and devices. This is particularly useful when you need to handle browser sessions in Puppeteer or perform complex interactions:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
// Set custom user agent
await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36');
await page.goto('https://example.com');
const content = await page.content();
console.log(content);
await browser.close();
})();
Setting User Agents in Other Languages
PHP with cURL
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://example.com');
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$response = curl_exec($ch);
curl_close($ch);
echo $response;
?>
Go with net/http
package main
import (
"fmt"
"io/ioutil"
"net/http"
)
func main() {
client := &http.Client{}
req, err := http.NewRequest("GET", "https://example.com", nil)
if err != nil {
panic(err)
}
req.Header.Set("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36")
resp, err := client.Do(req)
if err != nil {
panic(err)
}
defer resp.Body.Close()
body, err := ioutil.ReadAll(resp.Body)
if err != nil {
panic(err)
}
fmt.Println(string(body))
}
Best Practices for User Agent Management
1. Use Realistic User Agents
Always use real browser user agent strings. Avoid generic or obviously fake user agents like "MyBot/1.0" as these are easily detected.
2. Rotate User Agents
Implement user agent rotation to avoid patterns that might trigger anti-bot measures:
import itertools
import requests
user_agents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36'
]
# Create a cycling iterator
ua_cycle = itertools.cycle(user_agents)
def make_request(url):
headers = {'User-Agent': next(ua_cycle)}
return requests.get(url, headers=headers)
3. Match User Agent with Other Headers
Ensure your user agent is consistent with other headers like Accept
, Accept-Language
, and Accept-Encoding
:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'DNT': '1',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1'
}
4. Consider Mobile User Agents
For mobile-specific content, use mobile user agents:
mobile_user_agents = [
'Mozilla/5.0 (iPhone; CPU iPhone OS 14_6 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1',
'Mozilla/5.0 (Android 11; Mobile; rv:68.0) Gecko/68.0 Firefox/88.0',
'Mozilla/5.0 (Linux; Android 11; SM-G991B) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.120 Mobile Safari/537.36'
]
Advanced User Agent Techniques
Using Third-Party Libraries
For Python, consider using the fake-useragent
library:
pip install fake-useragent
from fake_useragent import UserAgent
ua = UserAgent()
# Get random user agent
headers = {'User-Agent': ua.random}
# Get specific browser user agent
headers_chrome = {'User-Agent': ua.chrome}
headers_firefox = {'User-Agent': ua.firefox}
Database of User Agents
Maintain a database of current user agents and update them regularly:
import sqlite3
import requests
import random
class UserAgentManager:
def __init__(self, db_path='user_agents.db'):
self.conn = sqlite3.connect(db_path)
self.create_table()
def create_table(self):
self.conn.execute('''
CREATE TABLE IF NOT EXISTS user_agents (
id INTEGER PRIMARY KEY,
user_agent TEXT UNIQUE,
browser TEXT,
last_used TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
''')
def add_user_agent(self, user_agent, browser):
try:
self.conn.execute(
'INSERT INTO user_agents (user_agent, browser) VALUES (?, ?)',
(user_agent, browser)
)
self.conn.commit()
except sqlite3.IntegrityError:
pass # User agent already exists
def get_random_user_agent(self):
cursor = self.conn.execute('SELECT user_agent FROM user_agents ORDER BY RANDOM() LIMIT 1')
result = cursor.fetchone()
return result[0] if result else None
Testing User Agent Configuration
Verify Your User Agent
Test if your custom user agent is being sent correctly:
import requests
headers = {'User-Agent': 'Your Custom User Agent'}
response = requests.get('https://httpbin.org/headers', headers=headers)
print(response.json())
Check for Consistency
When using sophisticated scraping tools like Puppeteer for navigating to different pages, ensure your user agent remains consistent across all requests and page navigations.
Common Pitfalls and Solutions
1. Inconsistent Headers
Problem: Using a Chrome user agent with Firefox-specific headers.
Solution: Create header sets that match specific browsers:
chrome_headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'sec-ch-ua': '"Chromium";v="91", " Not;A Brand";v="99"',
'sec-ch-ua-mobile': '?0'
}
2. Outdated User Agents
Problem: Using old browser versions that are easily detected.
Solution: Regularly update your user agent database with current browser versions.
3. Static User Agents
Problem: Using the same user agent for all requests.
Solution: Implement rotation and use different user agents for different sessions.
Console Commands and Testing
Testing User Agent in Terminal
Use curl to test different user agents from the command line:
# Test with Chrome user agent
curl -H "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36" https://httpbin.org/headers
# Test with Firefox user agent
curl -H "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0" https://httpbin.org/headers
Verify Response Differences
Some websites serve different content based on user agents:
# Get mobile version
curl -H "User-Agent: Mozilla/5.0 (iPhone; CPU iPhone OS 14_6 like Mac OS X) AppleWebKit/605.1.15" https://example.com
# Get desktop version
curl -H "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36" https://example.com
Integration with Web Scraping APIs
When using professional web scraping services, you can often specify custom user agents through API parameters. For example, with WebScraping.AI's API:
curl -X GET "https://api.webscraping.ai/html" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}'
This approach handles user agent management automatically while providing additional features like proxy rotation and CAPTCHA solving.
Conclusion
Setting custom HTTP user agents is essential for successful web scraping. By implementing proper user agent management with rotation, realistic browser strings, and consistent headers, you can significantly improve your scraping success rate while maintaining ethical scraping practices. Remember to always respect robots.txt files and website terms of service when implementing these techniques.
For more advanced scenarios involving complex page interactions and session management, consider exploring browser automation tools that provide additional control over request headers and user agent configuration, such as handling authentication in Puppeteer.