Scraper is a Rust crate that provides an easy-to-use interface for parsing HTML documents and extracting information from them. It is built on top of html5ever
, which is an HTML parsing library that closely follows the HTML specification.
To parse HTML documents with Scraper, you first need to add the scraper
crate to your Cargo.toml
file:
[dependencies]
scraper = "0.12.0" # Check for the latest version on crates.io
Once you've added the dependency, you can start using Scraper in your Rust code. Here's a step-by-step guide on how to parse an HTML document using Scraper:
Step 1: Create a scraper::Html
instance
You'll need to create an instance of the Html
struct by passing a string slice that contains the HTML document you want to parse.
extern crate scraper;
use scraper::Html;
fn main() {
let html_content = r#"
<!DOCTYPE html>
<html>
<head>
<title>Example HTML</title>
</head>
<body>
<h1>Hello, World!</h1>
<p>This is an example HTML document.</p>
</body>
</html>
"#;
// Parse the HTML document
let document = Html::parse_document(html_content);
}
Step 2: Select elements using CSS selectors
Scraper allows you to select elements within the HTML document using CSS selectors. You can use the select
method on the Html
instance to obtain an iterator over the matching elements.
use scraper::{Html, Selector};
fn main() {
// ... (previous code)
// Create a Selector instance for the elements you want to extract
let selector = Selector::parse("h1").unwrap();
// Iterate over the selected elements
for element in document.select(&selector) {
// Do something with each element, e.g., extract its text
let text = element.text().collect::<Vec<_>>().join("");
println!("Found heading: {}", text);
}
}
Step 3: Extract data from elements
Once you have selected the elements, you can extract data from them, like text or attribute values.
// ... (previous code)
fn main() {
// ... (previous code)
// Extract text
let p_selector = Selector::parse("p").unwrap();
for element in document.select(&p_selector) {
let text = element.text().collect::<Vec<_>>().join(" ");
println!("Paragraph text: {}", text);
}
// Extract attributes
let a_selector = Selector::parse("a").unwrap();
for element in document.select(&a_selector) {
if let Some(href) = element.value().attr("href") {
println!("Found link: {}", href);
}
}
}
The above example demonstrates how to parse an HTML document, select elements using CSS selectors, and extract text and attribute values from those elements. Scraper provides a straightforward way to perform web scraping tasks in Rust by leveraging the power of Rust's type system and the robust parsing capabilities of html5ever
.