Table of contents

What are the best Rust crates for web scraping?

Rust has emerged as a powerful language for web scraping, offering excellent performance, memory safety, and a growing ecosystem of specialized crates. This comprehensive guide covers the best Rust crates for web scraping, from HTTP clients to HTML parsing and headless browser automation.

Core HTTP Client Crates

1. reqwest - The Go-To HTTP Client

reqwest is the most popular HTTP client for Rust, offering both synchronous and asynchronous APIs with excellent ergonomics.

use reqwest;
use std::error::Error;

#[tokio::main]
async fn main() -> Result<(), Box<dyn Error>> {
    let client = reqwest::Client::new();

    let response = client
        .get("https://httpbin.org/json")
        .header("User-Agent", "Mozilla/5.0 (compatible; RustBot/1.0)")
        .send()
        .await?;

    let body = response.text().await?;
    println!("Response: {}", body);

    Ok(())
}

Key Features: - Async/await support with tokio - Built-in JSON serialization/deserialization - Cookie jar support - Proxy configuration - TLS/SSL support - Request/response middleware

Installation:

[dependencies]
reqwest = { version = "0.11", features = ["json", "cookies"] }
tokio = { version = "1", features = ["full"] }

2. ureq - Lightweight Synchronous Client

ureq is a simple, synchronous HTTP client that's perfect for straightforward scraping tasks without async complexity.

use ureq;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let response = ureq::get("https://httpbin.org/json")
        .set("User-Agent", "ureq-scraper/1.0")
        .call()?;

    let json: serde_json::Value = response.into_json()?;
    println!("Data: {}", json);

    Ok(())
}

HTML Parsing and CSS Selectors

3. scraper - Powerful HTML Parsing

scraper combines the best of both worlds: fast HTML parsing with intuitive CSS selector support.

use scraper::{Html, Selector};

fn main() {
    let html = r#"
        <html>
            <body>
                <div class="article">
                    <h1>Title 1</h1>
                    <p>Content 1</p>
                </div>
                <div class="article">
                    <h1>Title 2</h1>
                    <p>Content 2</p>
                </div>
            </body>
        </html>
    "#;

    let document = Html::parse_document(html);
    let article_selector = Selector::parse("div.article").unwrap();
    let title_selector = Selector::parse("h1").unwrap();
    let content_selector = Selector::parse("p").unwrap();

    for article in document.select(&article_selector) {
        let title = article.select(&title_selector).next().unwrap().text().collect::<Vec<_>>().join("");
        let content = article.select(&content_selector).next().unwrap().text().collect::<Vec<_>>().join("");

        println!("Title: {}, Content: {}", title, content);
    }
}

4. select - Alternative CSS Selector Library

select provides another approach to HTML parsing with a different API design.

use select::document::Document;
use select::predicate::{Class, Name};

fn main() {
    let html = "<div class='post'><h2>Title</h2><p>Content</p></div>";
    let document = Document::from(html);

    for node in document.find(Class("post")) {
        let title = node.find(Name("h2")).next().unwrap().text();
        let content = node.find(Name("p")).next().unwrap().text();
        println!("Title: {}, Content: {}", title, content);
    }
}

Headless Browser Automation

5. headless_chrome - Chrome DevTools Protocol

headless_chrome provides high-level Chrome automation capabilities, similar to how Puppeteer handles browser automation.

use headless_chrome::{Browser, LaunchOptionsBuilder};
use std::error::Error;

fn main() -> Result<(), Box<dyn Error>> {
    let browser = Browser::new(
        LaunchOptionsBuilder::default()
            .headless(true)
            .build()
            .unwrap()
    )?;

    let tab = browser.wait_for_initial_tab()?;

    tab.navigate_to("https://example.com")?;
    tab.wait_until_navigated()?;

    let title = tab.get_title()?;
    println!("Page title: {}", title);

    // Extract text content
    let content = tab.evaluate("document.body.innerText", false)?;
    println!("Page content: {:?}", content);

    Ok(())
}

6. thirtyfour - WebDriver Protocol

thirtyfour implements the WebDriver protocol for browser automation, offering compatibility with Selenium drivers.

use thirtyfour::prelude::*;

#[tokio::main]
async fn main() -> WebDriverResult<()> {
    let caps = DesiredCapabilities::chrome();
    let driver = WebDriver::new("http://localhost:9515", caps).await?;

    driver.get("https://example.com").await?;

    let title = driver.title().await?;
    println!("Title: {}", title);

    let element = driver.find(By::Tag("h1")).await?;
    let text = element.text().await?;
    println!("H1 text: {}", text);

    driver.quit().await?;
    Ok(())
}

Complete Web Scraping Examples

Basic Web Scraper with reqwest and scraper

use reqwest;
use scraper::{Html, Selector};
use std::error::Error;

#[tokio::main]
async fn main() -> Result<(), Box<dyn Error>> {
    let client = reqwest::Client::builder()
        .user_agent("Mozilla/5.0 (compatible; RustScraper/1.0)")
        .build()?;

    let response = client
        .get("https://quotes.toscrape.com/")
        .send()
        .await?;

    let html = response.text().await?;
    let document = Html::parse_document(&html);

    let quote_selector = Selector::parse("div.quote").unwrap();
    let text_selector = Selector::parse("span.text").unwrap();
    let author_selector = Selector::parse("small.author").unwrap();

    for quote_element in document.select(&quote_selector) {
        let text = quote_element
            .select(&text_selector)
            .next()
            .unwrap()
            .text()
            .collect::<Vec<_>>()
            .join("");

        let author = quote_element
            .select(&author_selector)
            .next()
            .unwrap()
            .text()
            .collect::<Vec<_>>()
            .join("");

        println!("Quote: {}", text);
        println!("Author: {}", author);
        println!("---");
    }

    Ok(())
}

Handling Forms and Sessions

use reqwest::{Client, cookie::Jar};
use std::sync::Arc;
use url::Url;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let jar = Arc::new(Jar::default());
    let client = Client::builder()
        .cookie_provider(jar.clone())
        .build()?;

    // Login form submission
    let login_url = "https://httpbin.org/post";
    let params = [
        ("username", "testuser"),
        ("password", "testpass"),
    ];

    let response = client
        .post(login_url)
        .form(&params)
        .send()
        .await?;

    println!("Login status: {}", response.status());

    // Access protected content with session
    let protected_url = "https://httpbin.org/cookies";
    let protected_response = client
        .get(protected_url)
        .send()
        .await?;

    let body = protected_response.text().await?;
    println!("Protected content: {}", body);

    Ok(())
}

Advanced Features and Utilities

7. tokio - Async Runtime

Most modern Rust web scrapers leverage tokio for asynchronous operations, enabling concurrent scraping of multiple pages.

use reqwest::Client;
use tokio;
use futures::future::join_all;

async fn scrape_url(client: &Client, url: &str) -> Result<String, reqwest::Error> {
    let response = client.get(url).send().await?;
    response.text().await
}

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = Client::new();
    let urls = vec![
        "https://httpbin.org/delay/1",
        "https://httpbin.org/delay/2",
        "https://httpbin.org/delay/3",
    ];

    let futures = urls.iter().map(|url| scrape_url(&client, url));
    let results = join_all(futures).await;

    for (i, result) in results.iter().enumerate() {
        match result {
            Ok(content) => println!("URL {}: {} chars", i, content.len()),
            Err(e) => println!("URL {}: Error - {}", i, e),
        }
    }

    Ok(())
}

8. serde - JSON and Data Serialization

serde is essential for handling JSON APIs and structured data extraction.

use reqwest;
use serde::{Deserialize, Serialize};

#[derive(Deserialize, Debug)]
struct ApiResponse {
    title: String,
    body: String,
    #[serde(rename = "userId")]
    user_id: u32,
}

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = reqwest::Client::new();
    let response: ApiResponse = client
        .get("https://jsonplaceholder.typicode.com/posts/1")
        .send()
        .await?
        .json()
        .await?;

    println!("Title: {}", response.title);
    println!("Body: {}", response.body);
    println!("User ID: {}", response.user_id);

    Ok(())
}

Error Handling and Resilience

Implementing Retry Logic

use reqwest;
use tokio::time::{sleep, Duration};
use std::error::Error;

async fn scrape_with_retry(url: &str, max_retries: u32) -> Result<String, Box<dyn Error>> {
    let client = reqwest::Client::new();
    let mut attempts = 0;

    loop {
        match client.get(url).send().await {
            Ok(response) => {
                if response.status().is_success() {
                    return Ok(response.text().await?);
                } else if attempts >= max_retries {
                    return Err(format!("Failed after {} attempts: {}", max_retries, response.status()).into());
                }
            }
            Err(e) if attempts >= max_retries => {
                return Err(e.into());
            }
            Err(_) => {
                // Continue to retry
            }
        }

        attempts += 1;
        let delay = Duration::from_secs(2_u64.pow(attempts.min(5))); // Exponential backoff
        sleep(delay).await;
    }
}

Performance Optimization Tips

  1. Connection Pooling: Use a single reqwest::Client instance across requests
  2. Concurrent Processing: Leverage tokio's async capabilities for parallel scraping
  3. Rate Limiting: Implement delays between requests to respect target servers
  4. Memory Management: Stream large responses instead of loading everything into memory

Crate Comparison Summary

| Crate | Use Case | Async Support | Complexity | |-------|----------|---------------|------------| | reqwest | General HTTP client | Yes | Medium | | ureq | Simple synchronous requests | No | Low | | scraper | HTML parsing with CSS selectors | N/A | Low | | select | Alternative HTML parsing | N/A | Low | | headless_chrome | Browser automation | No | High | | thirtyfour | WebDriver automation | Yes | High |

For most web scraping projects, combining reqwest for HTTP requests with scraper for HTML parsing provides an excellent foundation. When you need to handle JavaScript-heavy sites or complex user interactions, consider headless_chrome or thirtyfour for browser automation, similar to how browser automation tools handle dynamic content.

The Rust ecosystem for web scraping continues to evolve, offering developers powerful tools that combine performance with safety. Whether you're building a simple data extraction tool or a complex crawling system, these crates provide the building blocks for robust web scraping solutions.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon