How can I scrape websites that require User-Agent headers in Rust?
Many websites check the User-Agent header to identify the browser or client making HTTP requests. Some sites block requests without proper User-Agent headers or those from known bots and scrapers. In Rust, you can easily set custom User-Agent headers using popular HTTP client libraries like reqwest
, ureq
, or surf
.
Understanding User-Agent Headers
The User-Agent header identifies the client application, operating system, and browser making the request. Websites use this information for:
- Browser compatibility checks
- Analytics and tracking
- Bot detection and blocking
- Content customization
A typical User-Agent string looks like:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36
Setting User-Agent Headers with Reqwest
reqwest
is the most popular HTTP client library in Rust. Here's how to set User-Agent headers:
Basic User-Agent Setup
use reqwest;
use std::error::Error;
#[tokio::main]
async fn main() -> Result<(), Box<dyn Error>> {
let client = reqwest::Client::builder()
.user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36")
.build()?;
let response = client
.get("https://httpbin.org/user-agent")
.send()
.await?;
let text = response.text().await?;
println!("Response: {}", text);
Ok(())
}
Per-Request User-Agent Headers
You can also set User-Agent headers on individual requests:
use reqwest;
use std::error::Error;
#[tokio::main]
async fn main() -> Result<(), Box<dyn Error>> {
let client = reqwest::Client::new();
let response = client
.get("https://example.com")
.header("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36")
.send()
.await?;
let html = response.text().await?;
println!("Status: {}", response.status());
Ok(())
}
Using Custom Headers with Multiple Values
use reqwest::header::{HeaderMap, HeaderValue, USER_AGENT};
use std::error::Error;
#[tokio::main]
async fn main() -> Result<(), Box<dyn Error>> {
let mut headers = HeaderMap::new();
headers.insert(
USER_AGENT,
HeaderValue::from_static("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36")
);
let client = reqwest::Client::builder()
.default_headers(headers)
.build()?;
let response = client
.get("https://example.com")
.send()
.await?;
println!("Response status: {}", response.status());
Ok(())
}
Setting User-Agent Headers with Ureq
ureq
is a synchronous HTTP client that's simpler for basic use cases:
use ureq;
use std::error::Error;
fn main() -> Result<(), Box<dyn Error>> {
let agent = ureq::AgentBuilder::new()
.user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36")
.build();
let response = agent
.get("https://httpbin.org/user-agent")
.call()?;
let text = response.into_string()?;
println!("Response: {}", text);
Ok(())
}
Per-Request Headers with Ureq
use ureq;
fn scrape_with_custom_agent() -> ureq::Result<String> {
let response = ureq::get("https://example.com")
.set("User-Agent", "Mozilla/5.0 (iPhone; CPU iPhone OS 14_7_1 like Mac OS X)")
.call()?;
response.into_string()
}
User-Agent Rotation Strategy
To avoid detection, rotate between multiple User-Agent strings:
use reqwest;
use rand::seq::SliceRandom;
use std::error::Error;
const USER_AGENTS: &[&str] = &[
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:89.0) Gecko/20100101 Firefox/89.0",
];
struct RotatingUserAgent {
client: reqwest::Client,
agents: Vec<String>,
}
impl RotatingUserAgent {
fn new() -> Self {
Self {
client: reqwest::Client::new(),
agents: USER_AGENTS.iter().map(|s| s.to_string()).collect(),
}
}
fn get_random_agent(&self) -> &str {
let mut rng = rand::thread_rng();
self.agents.choose(&mut rng).unwrap()
}
async fn fetch(&self, url: &str) -> Result<String, Box<dyn Error>> {
let user_agent = self.get_random_agent();
let response = self.client
.get(url)
.header("User-Agent", user_agent)
.send()
.await?;
Ok(response.text().await?)
}
}
#[tokio::main]
async fn main() -> Result<(), Box<dyn Error>> {
let scraper = RotatingUserAgent::new();
for i in 1..=5 {
let html = scraper.fetch("https://httpbin.org/user-agent").await?;
println!("Request {}: {}", i, html);
// Add delay between requests
tokio::time::sleep(tokio::time::Duration::from_millis(1000)).await;
}
Ok(())
}
Advanced Header Management
For sophisticated scraping operations, implement a comprehensive header management system:
use reqwest::header::{HeaderMap, HeaderValue};
use std::collections::HashMap;
use std::error::Error;
struct HeaderManager {
browser_profiles: HashMap<String, BrowserProfile>,
}
struct BrowserProfile {
user_agent: String,
accept: String,
accept_language: String,
accept_encoding: String,
}
impl HeaderManager {
fn new() -> Self {
let mut profiles = HashMap::new();
profiles.insert("chrome_windows".to_string(), BrowserProfile {
user_agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36".to_string(),
accept: "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8".to_string(),
accept_language: "en-US,en;q=0.5".to_string(),
accept_encoding: "gzip, deflate, br".to_string(),
});
profiles.insert("firefox_linux".to_string(), BrowserProfile {
user_agent: "Mozilla/5.0 (X11; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0".to_string(),
accept: "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8".to_string(),
accept_language: "en-US,en;q=0.5".to_string(),
accept_encoding: "gzip, deflate".to_string(),
});
Self { browser_profiles: profiles }
}
fn get_headers(&self, profile_name: &str) -> Result<HeaderMap, Box<dyn Error>> {
let profile = self.browser_profiles
.get(profile_name)
.ok_or("Profile not found")?;
let mut headers = HeaderMap::new();
headers.insert("User-Agent", HeaderValue::from_str(&profile.user_agent)?);
headers.insert("Accept", HeaderValue::from_str(&profile.accept)?);
headers.insert("Accept-Language", HeaderValue::from_str(&profile.accept_language)?);
headers.insert("Accept-Encoding", HeaderValue::from_str(&profile.accept_encoding)?);
Ok(headers)
}
}
#[tokio::main]
async fn main() -> Result<(), Box<dyn Error>> {
let header_manager = HeaderManager::new();
let headers = header_manager.get_headers("chrome_windows")?;
let client = reqwest::Client::builder()
.default_headers(headers)
.build()?;
let response = client
.get("https://httpbin.org/headers")
.send()
.await?;
println!("Response: {}", response.text().await?);
Ok(())
}
Handling Mobile User-Agents
Some websites serve different content to mobile devices. Here's how to use mobile User-Agent strings:
use reqwest;
use std::error::Error;
const MOBILE_USER_AGENTS: &[&str] = &[
"Mozilla/5.0 (iPhone; CPU iPhone OS 14_7_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.2 Mobile/15E148 Safari/604.1",
"Mozilla/5.0 (Linux; Android 11; SM-G991B) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.120 Mobile Safari/537.36",
"Mozilla/5.0 (iPad; CPU OS 14_7_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.2 Mobile/15E148 Safari/604.1",
];
async fn scrape_mobile_version(url: &str) -> Result<String, Box<dyn Error>> {
let client = reqwest::Client::builder()
.user_agent(MOBILE_USER_AGENTS[0])
.build()?;
let response = client
.get(url)
.header("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8")
.header("Accept-Language", "en-US,en;q=0.5")
.send()
.await?;
Ok(response.text().await?)
}
Error Handling and Retry Logic
Implement robust error handling when dealing with User-Agent sensitive websites:
use reqwest;
use std::error::Error;
use std::time::Duration;
async fn fetch_with_retry(
url: &str,
max_retries: u32,
) -> Result<String, Box<dyn Error>> {
let user_agents = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36",
];
for attempt in 0..max_retries {
let user_agent = user_agents[attempt as usize % user_agents.len()];
let client = reqwest::Client::builder()
.user_agent(user_agent)
.timeout(Duration::from_secs(30))
.build()?;
match client.get(url).send().await {
Ok(response) => {
if response.status().is_success() {
return Ok(response.text().await?);
} else if response.status() == 403 || response.status() == 429 {
println!("Request blocked, trying different User-Agent...");
tokio::time::sleep(Duration::from_secs(2)).await;
continue;
}
}
Err(e) => {
println!("Request failed: {}, retrying...", e);
tokio::time::sleep(Duration::from_secs(1)).await;
}
}
}
Err("Max retries exceeded".into())
}
Working with Cargo Dependencies
To implement the examples above, add these dependencies to your Cargo.toml
:
[dependencies]
reqwest = { version = "0.11", features = ["json"] }
tokio = { version = "1", features = ["full"] }
ureq = "2.0"
rand = "0.8"
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
Testing User-Agent Headers
You can verify that your User-Agent headers are being sent correctly:
use reqwest;
use serde_json::Value;
use std::error::Error;
#[tokio::main]
async fn main() -> Result<(), Box<dyn Error>> {
let client = reqwest::Client::builder()
.user_agent("MyCustomBot/1.0")
.build()?;
let response = client
.get("https://httpbin.org/headers")
.send()
.await?;
let json: Value = response.json().await?;
println!("Headers sent: {:#}", json);
Ok(())
}
Best Practices for User-Agent Headers
- Use realistic User-Agent strings: Always use actual browser User-Agent strings
- Rotate headers: Don't use the same User-Agent for all requests
- Match browser behavior: Include other headers that real browsers send
- Respect rate limits: Add delays between requests to avoid triggering anti-bot measures
- Handle errors gracefully: Implement retry logic with different User-Agents
- Keep headers up to date: Browser User-Agent strings change over time
- Test your headers: Use tools like httpbin.org to verify your headers are correct
Common User-Agent Patterns
Desktop Browsers
const DESKTOP_AGENTS: &[&str] = &[
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:121.0) Gecko/20100101 Firefox/121.0",
];
Mobile Browsers
const MOBILE_AGENTS: &[&str] = &[
"Mozilla/5.0 (iPhone; CPU iPhone OS 17_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.0 Mobile/15E148 Safari/604.1",
"Mozilla/5.0 (Linux; Android 13; SM-G991B) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Mobile Safari/537.36",
];
Conclusion
Setting User-Agent headers in Rust is straightforward with libraries like reqwest
and ureq
. The key to successful web scraping is mimicking real browser behavior by using authentic User-Agent strings and rotating them appropriately. Remember to always respect robots.txt files and website terms of service when scraping.
For more advanced scraping scenarios involving JavaScript-heavy sites, consider using browser automation tools that can handle dynamic content that loads after page load or implement authentication mechanisms when dealing with protected resources.