Table of contents

How to Parse HTML Content Using Scraper Crate in Rust?

The scraper crate is one of the most popular HTML parsing libraries for Rust, providing a fast and ergonomic way to extract data from HTML documents. Built on top of the html5ever parser, it offers CSS selector support and a jQuery-like API that makes web scraping tasks straightforward and efficient.

What is the Scraper Crate?

The scraper crate is a Rust library that provides HTML parsing capabilities with CSS selector support. It's designed to be fast, memory-efficient, and easy to use, making it an excellent choice for web scraping, HTML processing, and data extraction tasks in Rust applications.

Key features of the scraper crate include: - CSS selector support for precise element targeting - Fast HTML5 parsing with html5ever - Memory-efficient document representation - Iterator-based element traversal - Text extraction and attribute access

Installation and Setup

To start using the scraper crate, add it to your Cargo.toml file:

[dependencies]
scraper = "0.18"
reqwest = { version = "0.11", features = ["blocking"] }
tokio = { version = "1", features = ["full"] }

The reqwest crate is included for making HTTP requests to fetch HTML content, and tokio provides async runtime support.

Basic HTML Parsing

Here's a simple example of parsing HTML content using the scraper crate:

use scraper::{Html, Selector};

fn main() {
    let html = r#"
        <html>
            <head><title>Sample Page</title></head>
            <body>
                <div class="container">
                    <h1>Welcome</h1>
                    <p class="description">This is a sample paragraph.</p>
                    <ul>
                        <li>Item 1</li>
                        <li>Item 2</li>
                        <li>Item 3</li>
                    </ul>
                </div>
            </body>
        </html>
    "#;

    // Parse the HTML document
    let document = Html::parse_document(html);

    // Create a CSS selector
    let title_selector = Selector::parse("title").unwrap();
    let h1_selector = Selector::parse("h1").unwrap();

    // Extract elements
    for element in document.select(&title_selector) {
        println!("Title: {}", element.text().collect::<String>());
    }

    for element in document.select(&h1_selector) {
        println!("Heading: {}", element.text().collect::<String>());
    }
}

CSS Selectors in Scraper

The scraper crate supports a wide range of CSS selectors, making it easy to target specific elements:

use scraper::{Html, Selector};

fn demonstrate_selectors() {
    let html = r#"
        <div class="content">
            <article id="main-article" class="post featured">
                <h2>Article Title</h2>
                <p class="meta">By <span class="author">John Doe</span></p>
                <div class="content-body">
                    <p>First paragraph</p>
                    <p>Second paragraph</p>
                </div>
            </article>
        </div>
    "#;

    let document = Html::parse_document(html);

    // Various selector examples
    let selectors = vec![
        ("article", "Select by tag name"),
        (".post", "Select by class"),
        ("#main-article", "Select by ID"),
        ("article.featured", "Select by tag and class"),
        (".content > article", "Direct child selector"),
        ("p + p", "Adjacent sibling selector"),
        ("[class*='content']", "Attribute contains selector"),
        ("article h2", "Descendant selector"),
    ];

    for (selector_str, description) in selectors {
        let selector = Selector::parse(selector_str).unwrap();
        let count = document.select(&selector).count();
        println!("{}: {} elements found", description, count);
    }
}

Extracting Text and Attributes

The scraper crate provides multiple ways to extract text content and attributes from elements:

use scraper::{Html, Selector};

fn extract_data() {
    let html = r#"
        <div class="product" data-id="12345">
            <h3 class="title">Laptop Computer</h3>
            <span class="price" data-currency="USD">$999.99</span>
            <img src="/images/laptop.jpg" alt="Laptop" width="300" height="200">
            <a href="/products/laptop" class="view-details">View Details</a>
        </div>
    "#;

    let document = Html::parse_document(html);

    // Extract text content
    let title_selector = Selector::parse(".title").unwrap();
    if let Some(title) = document.select(&title_selector).next() {
        println!("Product title: {}", title.text().collect::<String>());
    }

    // Extract attributes
    let product_selector = Selector::parse(".product").unwrap();
    if let Some(product) = document.select(&product_selector).next() {
        if let Some(id) = product.value().attr("data-id") {
            println!("Product ID: {}", id);
        }
    }

    // Extract image attributes
    let img_selector = Selector::parse("img").unwrap();
    if let Some(img) = document.select(&img_selector).next() {
        println!("Image src: {}", img.value().attr("src").unwrap_or(""));
        println!("Image alt: {}", img.value().attr("alt").unwrap_or(""));
        println!("Image width: {}", img.value().attr("width").unwrap_or(""));
    }

    // Extract link href
    let link_selector = Selector::parse("a").unwrap();
    if let Some(link) = document.select(&link_selector).next() {
        println!("Link URL: {}", link.value().attr("href").unwrap_or(""));
        println!("Link text: {}", link.text().collect::<String>());
    }
}

Working with Tables

Parsing HTML tables is a common requirement in web scraping. Here's how to handle tables with the scraper crate:

use scraper::{Html, Selector};

fn parse_table() {
    let html = r#"
        <table class="data-table">
            <thead>
                <tr>
                    <th>Name</th>
                    <th>Age</th>
                    <th>City</th>
                </tr>
            </thead>
            <tbody>
                <tr>
                    <td>Alice</td>
                    <td>25</td>
                    <td>New York</td>
                </tr>
                <tr>
                    <td>Bob</td>
                    <td>30</td>
                    <td>San Francisco</td>
                </tr>
            </tbody>
        </table>
    "#;

    let document = Html::parse_document(html);

    // Extract table headers
    let header_selector = Selector::parse("th").unwrap();
    let headers: Vec<String> = document
        .select(&header_selector)
        .map(|th| th.text().collect::<String>())
        .collect();
    println!("Headers: {:?}", headers);

    // Extract table rows
    let row_selector = Selector::parse("tbody tr").unwrap();
    let cell_selector = Selector::parse("td").unwrap();

    for row in document.select(&row_selector) {
        let cells: Vec<String> = row
            .select(&cell_selector)
            .map(|td| td.text().collect::<String>())
            .collect();
        println!("Row data: {:?}", cells);
    }
}

Fetching and Parsing Web Pages

Combine the scraper crate with HTTP clients like reqwest to fetch and parse web pages:

use reqwest;
use scraper::{Html, Selector};
use std::error::Error;

#[tokio::main]
async fn main() -> Result<(), Box<dyn Error>> {
    // Fetch HTML content from a web page
    let url = "https://httpbin.org/html";
    let response = reqwest::get(url).await?;
    let body = response.text().await?;

    // Parse the HTML
    let document = Html::parse_document(&body);

    // Extract specific elements
    let h1_selector = Selector::parse("h1").unwrap();
    for element in document.select(&h1_selector) {
        println!("Found heading: {}", element.text().collect::<String>());
    }

    // Extract all links
    let link_selector = Selector::parse("a[href]").unwrap();
    for link in document.select(&link_selector) {
        let href = link.value().attr("href").unwrap_or("");
        let text = link.text().collect::<String>();
        println!("Link: {} -> {}", text.trim(), href);
    }

    Ok(())
}

Advanced Parsing Techniques

Handling Forms

Extract form data and input fields:

use scraper::{Html, Selector};

fn parse_forms() {
    let html = r#"
        <form action="/submit" method="post">
            <input type="text" name="username" placeholder="Username" required>
            <input type="password" name="password" placeholder="Password">
            <select name="country">
                <option value="us">United States</option>
                <option value="ca" selected>Canada</option>
            </select>
            <input type="submit" value="Login">
        </form>
    "#;

    let document = Html::parse_document(html);

    // Extract form attributes
    let form_selector = Selector::parse("form").unwrap();
    if let Some(form) = document.select(&form_selector).next() {
        println!("Form action: {}", form.value().attr("action").unwrap_or(""));
        println!("Form method: {}", form.value().attr("method").unwrap_or(""));
    }

    // Extract input fields
    let input_selector = Selector::parse("input").unwrap();
    for input in document.select(&input_selector) {
        let name = input.value().attr("name").unwrap_or("");
        let input_type = input.value().attr("type").unwrap_or("");
        let placeholder = input.value().attr("placeholder").unwrap_or("");
        println!("Input: {} (type: {}, placeholder: {})", name, input_type, placeholder);
    }

    // Extract selected option
    let selected_option_selector = Selector::parse("option[selected]").unwrap();
    for option in document.select(&selected_option_selector) {
        let value = option.value().attr("value").unwrap_or("");
        let text = option.text().collect::<String>();
        println!("Selected option: {} (value: {})", text, value);
    }
}

Processing Lists and Navigation

Extract structured data from lists and navigation elements:

use scraper::{Html, Selector};

fn parse_navigation() {
    let html = r#"
        <nav class="main-nav">
            <ul>
                <li><a href="/">Home</a></li>
                <li><a href="/about">About</a></li>
                <li class="dropdown">
                    <a href="/services">Services</a>
                    <ul class="submenu">
                        <li><a href="/web-design">Web Design</a></li>
                        <li><a href="/development">Development</a></li>
                    </ul>
                </li>
            </ul>
        </nav>
    "#;

    let document = Html::parse_document(html);

    // Extract main navigation items
    let nav_selector = Selector::parse("nav > ul > li > a").unwrap();
    for link in document.select(&nav_selector) {
        let href = link.value().attr("href").unwrap_or("");
        let text = link.text().collect::<String>();
        println!("Main nav: {} -> {}", text, href);
    }

    // Extract submenu items
    let submenu_selector = Selector::parse(".submenu a").unwrap();
    for link in document.select(&submenu_selector) {
        let href = link.value().attr("href").unwrap_or("");
        let text = link.text().collect::<String>();
        println!("Submenu: {} -> {}", text, href);
    }
}

Error Handling and Best Practices

Implement proper error handling when parsing HTML:

use scraper::{Html, Selector};
use std::error::Error;
use std::fmt;

#[derive(Debug)]
struct ParseError {
    message: String,
}

impl fmt::Display for ParseError {
    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
        write!(f, "Parse error: {}", self.message)
    }
}

impl Error for ParseError {}

fn safe_parse_html(html: &str, selector_str: &str) -> Result<Vec<String>, Box<dyn Error>> {
    // Parse the HTML document
    let document = Html::parse_document(html);

    // Create selector with error handling
    let selector = Selector::parse(selector_str)
        .map_err(|e| ParseError {
            message: format!("Invalid CSS selector '{}': {:?}", selector_str, e),
        })?;

    // Extract text from matching elements
    let results: Vec<String> = document
        .select(&selector)
        .map(|element| element.text().collect::<String>())
        .collect();

    if results.is_empty() {
        return Err(Box::new(ParseError {
            message: format!("No elements found for selector '{}'", selector_str),
        }));
    }

    Ok(results)
}

fn main() {
    let html = "<div class='content'><p>Hello World</p></div>";

    match safe_parse_html(html, "p") {
        Ok(results) => {
            for result in results {
                println!("Found: {}", result);
            }
        }
        Err(e) => {
            eprintln!("Error: {}", e);
        }
    }
}

Performance Considerations

The scraper crate is designed for performance, but here are some tips to optimize your HTML parsing:

use scraper::{Html, Selector};
use std::collections::HashMap;

// Pre-compile selectors for better performance
struct HtmlParser {
    selectors: HashMap<String, Selector>,
}

impl HtmlParser {
    fn new() -> Self {
        let mut selectors = HashMap::new();

        // Pre-compile commonly used selectors
        selectors.insert("title".to_string(), Selector::parse("title").unwrap());
        selectors.insert("links".to_string(), Selector::parse("a[href]").unwrap());
        selectors.insert("images".to_string(), Selector::parse("img").unwrap());
        selectors.insert("headings".to_string(), Selector::parse("h1, h2, h3, h4, h5, h6").unwrap());

        HtmlParser { selectors }
    }

    fn parse_document(&self, html: &str) -> ParseResult {
        let document = Html::parse_document(html);
        let mut result = ParseResult::default();

        // Extract title
        if let Some(selector) = self.selectors.get("title") {
            if let Some(title_element) = document.select(selector).next() {
                result.title = title_element.text().collect::<String>();
            }
        }

        // Extract links
        if let Some(selector) = self.selectors.get("links") {
            for link in document.select(selector) {
                let href = link.value().attr("href").unwrap_or("").to_string();
                let text = link.text().collect::<String>();
                result.links.push((text, href));
            }
        }

        result
    }
}

#[derive(Default)]
struct ParseResult {
    title: String,
    links: Vec<(String, String)>,
}

Integration with Other Tools

While the scraper crate handles static HTML parsing excellently, for dynamic content that requires JavaScript execution, you might need to combine it with browser automation tools. For dynamic web scraping scenarios, consider using tools like Puppeteer for handling AJAX requests or browser automation for single page applications.

Conclusion

The scraper crate provides a powerful and efficient solution for HTML parsing in Rust applications. Its CSS selector support, combined with Rust's performance characteristics, makes it an excellent choice for web scraping projects. By following the patterns and examples shown in this guide, you can build robust HTML parsing applications that handle various document structures and extraction requirements effectively.

Whether you're building a web scraper, processing HTML documents, or extracting structured data from web pages, the scraper crate offers the tools and flexibility needed to accomplish your goals efficiently in Rust.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon