How does Scraper (Rust) parse HTML documents?

Scraper is a Rust crate that provides an easy-to-use interface for parsing HTML documents and extracting information from them. It is built on top of html5ever, which is an HTML parsing library that closely follows the HTML specification.

To parse HTML documents with Scraper, you first need to add the scraper crate to your Cargo.toml file:

[dependencies]
scraper = "0.12.0" # Check for the latest version on crates.io

Once you've added the dependency, you can start using Scraper in your Rust code. Here's a step-by-step guide on how to parse an HTML document using Scraper:

Step 1: Create a scraper::Html instance

You'll need to create an instance of the Html struct by passing a string slice that contains the HTML document you want to parse.

extern crate scraper;

use scraper::Html;

fn main() {
    let html_content = r#"
        <!DOCTYPE html>
        <html>
        <head>
            <title>Example HTML</title>
        </head>
        <body>
            <h1>Hello, World!</h1>
            <p>This is an example HTML document.</p>
        </body>
        </html>
    "#;

    // Parse the HTML document
    let document = Html::parse_document(html_content);
}

Step 2: Select elements using CSS selectors

Scraper allows you to select elements within the HTML document using CSS selectors. You can use the select method on the Html instance to obtain an iterator over the matching elements.

use scraper::{Html, Selector};

fn main() {
    // ... (previous code)

    // Create a Selector instance for the elements you want to extract
    let selector = Selector::parse("h1").unwrap();

    // Iterate over the selected elements
    for element in document.select(&selector) {
        // Do something with each element, e.g., extract its text
        let text = element.text().collect::<Vec<_>>().join("");
        println!("Found heading: {}", text);
    }
}

Step 3: Extract data from elements

Once you have selected the elements, you can extract data from them, like text or attribute values.

// ... (previous code)

fn main() {
    // ... (previous code)

    // Extract text
    let p_selector = Selector::parse("p").unwrap();
    for element in document.select(&p_selector) {
        let text = element.text().collect::<Vec<_>>().join(" ");
        println!("Paragraph text: {}", text);
    }

    // Extract attributes
    let a_selector = Selector::parse("a").unwrap();
    for element in document.select(&a_selector) {
        if let Some(href) = element.value().attr("href") {
            println!("Found link: {}", href);
        }
    }
}

The above example demonstrates how to parse an HTML document, select elements using CSS selectors, and extract text and attribute values from those elements. Scraper provides a straightforward way to perform web scraping tasks in Rust by leveraging the power of Rust's type system and the robust parsing capabilities of html5ever.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon