Table of contents

How can I scrape websites that require User-Agent headers in Rust?

Many websites check the User-Agent header to identify the browser or client making HTTP requests. Some sites block requests without proper User-Agent headers or those from known bots and scrapers. In Rust, you can easily set custom User-Agent headers using popular HTTP client libraries like reqwest, ureq, or surf.

Understanding User-Agent Headers

The User-Agent header identifies the client application, operating system, and browser making the request. Websites use this information for:

  • Browser compatibility checks
  • Analytics and tracking
  • Bot detection and blocking
  • Content customization

A typical User-Agent string looks like: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36

Setting User-Agent Headers with Reqwest

reqwest is the most popular HTTP client library in Rust. Here's how to set User-Agent headers:

Basic User-Agent Setup

use reqwest;
use std::error::Error;

#[tokio::main]
async fn main() -> Result<(), Box<dyn Error>> {
    let client = reqwest::Client::builder()
        .user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36")
        .build()?;

    let response = client
        .get("https://httpbin.org/user-agent")
        .send()
        .await?;

    let text = response.text().await?;
    println!("Response: {}", text);

    Ok(())
}

Per-Request User-Agent Headers

You can also set User-Agent headers on individual requests:

use reqwest;
use std::error::Error;

#[tokio::main]
async fn main() -> Result<(), Box<dyn Error>> {
    let client = reqwest::Client::new();

    let response = client
        .get("https://example.com")
        .header("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36")
        .send()
        .await?;

    let html = response.text().await?;
    println!("Status: {}", response.status());

    Ok(())
}

Using Custom Headers with Multiple Values

use reqwest::header::{HeaderMap, HeaderValue, USER_AGENT};
use std::error::Error;

#[tokio::main]
async fn main() -> Result<(), Box<dyn Error>> {
    let mut headers = HeaderMap::new();
    headers.insert(
        USER_AGENT,
        HeaderValue::from_static("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36")
    );

    let client = reqwest::Client::builder()
        .default_headers(headers)
        .build()?;

    let response = client
        .get("https://example.com")
        .send()
        .await?;

    println!("Response status: {}", response.status());

    Ok(())
}

Setting User-Agent Headers with Ureq

ureq is a synchronous HTTP client that's simpler for basic use cases:

use ureq;
use std::error::Error;

fn main() -> Result<(), Box<dyn Error>> {
    let agent = ureq::AgentBuilder::new()
        .user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36")
        .build();

    let response = agent
        .get("https://httpbin.org/user-agent")
        .call()?;

    let text = response.into_string()?;
    println!("Response: {}", text);

    Ok(())
}

Per-Request Headers with Ureq

use ureq;

fn scrape_with_custom_agent() -> ureq::Result<String> {
    let response = ureq::get("https://example.com")
        .set("User-Agent", "Mozilla/5.0 (iPhone; CPU iPhone OS 14_7_1 like Mac OS X)")
        .call()?;

    response.into_string()
}

User-Agent Rotation Strategy

To avoid detection, rotate between multiple User-Agent strings:

use reqwest;
use rand::seq::SliceRandom;
use std::error::Error;

const USER_AGENTS: &[&str] = &[
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:89.0) Gecko/20100101 Firefox/89.0",
];

struct RotatingUserAgent {
    client: reqwest::Client,
    agents: Vec<String>,
}

impl RotatingUserAgent {
    fn new() -> Self {
        Self {
            client: reqwest::Client::new(),
            agents: USER_AGENTS.iter().map(|s| s.to_string()).collect(),
        }
    }

    fn get_random_agent(&self) -> &str {
        let mut rng = rand::thread_rng();
        self.agents.choose(&mut rng).unwrap()
    }

    async fn fetch(&self, url: &str) -> Result<String, Box<dyn Error>> {
        let user_agent = self.get_random_agent();

        let response = self.client
            .get(url)
            .header("User-Agent", user_agent)
            .send()
            .await?;

        Ok(response.text().await?)
    }
}

#[tokio::main]
async fn main() -> Result<(), Box<dyn Error>> {
    let scraper = RotatingUserAgent::new();

    for i in 1..=5 {
        let html = scraper.fetch("https://httpbin.org/user-agent").await?;
        println!("Request {}: {}", i, html);

        // Add delay between requests
        tokio::time::sleep(tokio::time::Duration::from_millis(1000)).await;
    }

    Ok(())
}

Advanced Header Management

For sophisticated scraping operations, implement a comprehensive header management system:

use reqwest::header::{HeaderMap, HeaderValue};
use std::collections::HashMap;
use std::error::Error;

struct HeaderManager {
    browser_profiles: HashMap<String, BrowserProfile>,
}

struct BrowserProfile {
    user_agent: String,
    accept: String,
    accept_language: String,
    accept_encoding: String,
}

impl HeaderManager {
    fn new() -> Self {
        let mut profiles = HashMap::new();

        profiles.insert("chrome_windows".to_string(), BrowserProfile {
            user_agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36".to_string(),
            accept: "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8".to_string(),
            accept_language: "en-US,en;q=0.5".to_string(),
            accept_encoding: "gzip, deflate, br".to_string(),
        });

        profiles.insert("firefox_linux".to_string(), BrowserProfile {
            user_agent: "Mozilla/5.0 (X11; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0".to_string(),
            accept: "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8".to_string(),
            accept_language: "en-US,en;q=0.5".to_string(),
            accept_encoding: "gzip, deflate".to_string(),
        });

        Self { browser_profiles: profiles }
    }

    fn get_headers(&self, profile_name: &str) -> Result<HeaderMap, Box<dyn Error>> {
        let profile = self.browser_profiles
            .get(profile_name)
            .ok_or("Profile not found")?;

        let mut headers = HeaderMap::new();
        headers.insert("User-Agent", HeaderValue::from_str(&profile.user_agent)?);
        headers.insert("Accept", HeaderValue::from_str(&profile.accept)?);
        headers.insert("Accept-Language", HeaderValue::from_str(&profile.accept_language)?);
        headers.insert("Accept-Encoding", HeaderValue::from_str(&profile.accept_encoding)?);

        Ok(headers)
    }
}

#[tokio::main]
async fn main() -> Result<(), Box<dyn Error>> {
    let header_manager = HeaderManager::new();
    let headers = header_manager.get_headers("chrome_windows")?;

    let client = reqwest::Client::builder()
        .default_headers(headers)
        .build()?;

    let response = client
        .get("https://httpbin.org/headers")
        .send()
        .await?;

    println!("Response: {}", response.text().await?);

    Ok(())
}

Handling Mobile User-Agents

Some websites serve different content to mobile devices. Here's how to use mobile User-Agent strings:

use reqwest;
use std::error::Error;

const MOBILE_USER_AGENTS: &[&str] = &[
    "Mozilla/5.0 (iPhone; CPU iPhone OS 14_7_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.2 Mobile/15E148 Safari/604.1",
    "Mozilla/5.0 (Linux; Android 11; SM-G991B) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.120 Mobile Safari/537.36",
    "Mozilla/5.0 (iPad; CPU OS 14_7_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.2 Mobile/15E148 Safari/604.1",
];

async fn scrape_mobile_version(url: &str) -> Result<String, Box<dyn Error>> {
    let client = reqwest::Client::builder()
        .user_agent(MOBILE_USER_AGENTS[0])
        .build()?;

    let response = client
        .get(url)
        .header("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8")
        .header("Accept-Language", "en-US,en;q=0.5")
        .send()
        .await?;

    Ok(response.text().await?)
}

Error Handling and Retry Logic

Implement robust error handling when dealing with User-Agent sensitive websites:

use reqwest;
use std::error::Error;
use std::time::Duration;

async fn fetch_with_retry(
    url: &str,
    max_retries: u32,
) -> Result<String, Box<dyn Error>> {
    let user_agents = [
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36",
    ];

    for attempt in 0..max_retries {
        let user_agent = user_agents[attempt as usize % user_agents.len()];

        let client = reqwest::Client::builder()
            .user_agent(user_agent)
            .timeout(Duration::from_secs(30))
            .build()?;

        match client.get(url).send().await {
            Ok(response) => {
                if response.status().is_success() {
                    return Ok(response.text().await?);
                } else if response.status() == 403 || response.status() == 429 {
                    println!("Request blocked, trying different User-Agent...");
                    tokio::time::sleep(Duration::from_secs(2)).await;
                    continue;
                }
            }
            Err(e) => {
                println!("Request failed: {}, retrying...", e);
                tokio::time::sleep(Duration::from_secs(1)).await;
            }
        }
    }

    Err("Max retries exceeded".into())
}

Working with Cargo Dependencies

To implement the examples above, add these dependencies to your Cargo.toml:

[dependencies]
reqwest = { version = "0.11", features = ["json"] }
tokio = { version = "1", features = ["full"] }
ureq = "2.0"
rand = "0.8"
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"

Testing User-Agent Headers

You can verify that your User-Agent headers are being sent correctly:

use reqwest;
use serde_json::Value;
use std::error::Error;

#[tokio::main]
async fn main() -> Result<(), Box<dyn Error>> {
    let client = reqwest::Client::builder()
        .user_agent("MyCustomBot/1.0")
        .build()?;

    let response = client
        .get("https://httpbin.org/headers")
        .send()
        .await?;

    let json: Value = response.json().await?;
    println!("Headers sent: {:#}", json);

    Ok(())
}

Best Practices for User-Agent Headers

  1. Use realistic User-Agent strings: Always use actual browser User-Agent strings
  2. Rotate headers: Don't use the same User-Agent for all requests
  3. Match browser behavior: Include other headers that real browsers send
  4. Respect rate limits: Add delays between requests to avoid triggering anti-bot measures
  5. Handle errors gracefully: Implement retry logic with different User-Agents
  6. Keep headers up to date: Browser User-Agent strings change over time
  7. Test your headers: Use tools like httpbin.org to verify your headers are correct

Common User-Agent Patterns

Desktop Browsers

const DESKTOP_AGENTS: &[&str] = &[
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:121.0) Gecko/20100101 Firefox/121.0",
];

Mobile Browsers

const MOBILE_AGENTS: &[&str] = &[
    "Mozilla/5.0 (iPhone; CPU iPhone OS 17_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.0 Mobile/15E148 Safari/604.1",
    "Mozilla/5.0 (Linux; Android 13; SM-G991B) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Mobile Safari/537.36",
];

Conclusion

Setting User-Agent headers in Rust is straightforward with libraries like reqwest and ureq. The key to successful web scraping is mimicking real browser behavior by using authentic User-Agent strings and rotating them appropriately. Remember to always respect robots.txt files and website terms of service when scraping.

For more advanced scraping scenarios involving JavaScript-heavy sites, consider using browser automation tools that can handle dynamic content that loads after page load or implement authentication mechanisms when dealing with protected resources.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon