How to Parse HTML Content Using Scraper Crate in Rust?
The scraper
crate is one of the most popular HTML parsing libraries for Rust, providing a fast and ergonomic way to extract data from HTML documents. Built on top of the html5ever
parser, it offers CSS selector support and a jQuery-like API that makes web scraping tasks straightforward and efficient.
What is the Scraper Crate?
The scraper
crate is a Rust library that provides HTML parsing capabilities with CSS selector support. It's designed to be fast, memory-efficient, and easy to use, making it an excellent choice for web scraping, HTML processing, and data extraction tasks in Rust applications.
Key features of the scraper crate include:
- CSS selector support for precise element targeting
- Fast HTML5 parsing with html5ever
- Memory-efficient document representation
- Iterator-based element traversal
- Text extraction and attribute access
Installation and Setup
To start using the scraper crate, add it to your Cargo.toml
file:
[dependencies]
scraper = "0.18"
reqwest = { version = "0.11", features = ["blocking"] }
tokio = { version = "1", features = ["full"] }
The reqwest
crate is included for making HTTP requests to fetch HTML content, and tokio
provides async runtime support.
Basic HTML Parsing
Here's a simple example of parsing HTML content using the scraper crate:
use scraper::{Html, Selector};
fn main() {
let html = r#"
<html>
<head><title>Sample Page</title></head>
<body>
<div class="container">
<h1>Welcome</h1>
<p class="description">This is a sample paragraph.</p>
<ul>
<li>Item 1</li>
<li>Item 2</li>
<li>Item 3</li>
</ul>
</div>
</body>
</html>
"#;
// Parse the HTML document
let document = Html::parse_document(html);
// Create a CSS selector
let title_selector = Selector::parse("title").unwrap();
let h1_selector = Selector::parse("h1").unwrap();
// Extract elements
for element in document.select(&title_selector) {
println!("Title: {}", element.text().collect::<String>());
}
for element in document.select(&h1_selector) {
println!("Heading: {}", element.text().collect::<String>());
}
}
CSS Selectors in Scraper
The scraper crate supports a wide range of CSS selectors, making it easy to target specific elements:
use scraper::{Html, Selector};
fn demonstrate_selectors() {
let html = r#"
<div class="content">
<article id="main-article" class="post featured">
<h2>Article Title</h2>
<p class="meta">By <span class="author">John Doe</span></p>
<div class="content-body">
<p>First paragraph</p>
<p>Second paragraph</p>
</div>
</article>
</div>
"#;
let document = Html::parse_document(html);
// Various selector examples
let selectors = vec![
("article", "Select by tag name"),
(".post", "Select by class"),
("#main-article", "Select by ID"),
("article.featured", "Select by tag and class"),
(".content > article", "Direct child selector"),
("p + p", "Adjacent sibling selector"),
("[class*='content']", "Attribute contains selector"),
("article h2", "Descendant selector"),
];
for (selector_str, description) in selectors {
let selector = Selector::parse(selector_str).unwrap();
let count = document.select(&selector).count();
println!("{}: {} elements found", description, count);
}
}
Extracting Text and Attributes
The scraper crate provides multiple ways to extract text content and attributes from elements:
use scraper::{Html, Selector};
fn extract_data() {
let html = r#"
<div class="product" data-id="12345">
<h3 class="title">Laptop Computer</h3>
<span class="price" data-currency="USD">$999.99</span>
<img src="/images/laptop.jpg" alt="Laptop" width="300" height="200">
<a href="/products/laptop" class="view-details">View Details</a>
</div>
"#;
let document = Html::parse_document(html);
// Extract text content
let title_selector = Selector::parse(".title").unwrap();
if let Some(title) = document.select(&title_selector).next() {
println!("Product title: {}", title.text().collect::<String>());
}
// Extract attributes
let product_selector = Selector::parse(".product").unwrap();
if let Some(product) = document.select(&product_selector).next() {
if let Some(id) = product.value().attr("data-id") {
println!("Product ID: {}", id);
}
}
// Extract image attributes
let img_selector = Selector::parse("img").unwrap();
if let Some(img) = document.select(&img_selector).next() {
println!("Image src: {}", img.value().attr("src").unwrap_or(""));
println!("Image alt: {}", img.value().attr("alt").unwrap_or(""));
println!("Image width: {}", img.value().attr("width").unwrap_or(""));
}
// Extract link href
let link_selector = Selector::parse("a").unwrap();
if let Some(link) = document.select(&link_selector).next() {
println!("Link URL: {}", link.value().attr("href").unwrap_or(""));
println!("Link text: {}", link.text().collect::<String>());
}
}
Working with Tables
Parsing HTML tables is a common requirement in web scraping. Here's how to handle tables with the scraper crate:
use scraper::{Html, Selector};
fn parse_table() {
let html = r#"
<table class="data-table">
<thead>
<tr>
<th>Name</th>
<th>Age</th>
<th>City</th>
</tr>
</thead>
<tbody>
<tr>
<td>Alice</td>
<td>25</td>
<td>New York</td>
</tr>
<tr>
<td>Bob</td>
<td>30</td>
<td>San Francisco</td>
</tr>
</tbody>
</table>
"#;
let document = Html::parse_document(html);
// Extract table headers
let header_selector = Selector::parse("th").unwrap();
let headers: Vec<String> = document
.select(&header_selector)
.map(|th| th.text().collect::<String>())
.collect();
println!("Headers: {:?}", headers);
// Extract table rows
let row_selector = Selector::parse("tbody tr").unwrap();
let cell_selector = Selector::parse("td").unwrap();
for row in document.select(&row_selector) {
let cells: Vec<String> = row
.select(&cell_selector)
.map(|td| td.text().collect::<String>())
.collect();
println!("Row data: {:?}", cells);
}
}
Fetching and Parsing Web Pages
Combine the scraper crate with HTTP clients like reqwest
to fetch and parse web pages:
use reqwest;
use scraper::{Html, Selector};
use std::error::Error;
#[tokio::main]
async fn main() -> Result<(), Box<dyn Error>> {
// Fetch HTML content from a web page
let url = "https://httpbin.org/html";
let response = reqwest::get(url).await?;
let body = response.text().await?;
// Parse the HTML
let document = Html::parse_document(&body);
// Extract specific elements
let h1_selector = Selector::parse("h1").unwrap();
for element in document.select(&h1_selector) {
println!("Found heading: {}", element.text().collect::<String>());
}
// Extract all links
let link_selector = Selector::parse("a[href]").unwrap();
for link in document.select(&link_selector) {
let href = link.value().attr("href").unwrap_or("");
let text = link.text().collect::<String>();
println!("Link: {} -> {}", text.trim(), href);
}
Ok(())
}
Advanced Parsing Techniques
Handling Forms
Extract form data and input fields:
use scraper::{Html, Selector};
fn parse_forms() {
let html = r#"
<form action="/submit" method="post">
<input type="text" name="username" placeholder="Username" required>
<input type="password" name="password" placeholder="Password">
<select name="country">
<option value="us">United States</option>
<option value="ca" selected>Canada</option>
</select>
<input type="submit" value="Login">
</form>
"#;
let document = Html::parse_document(html);
// Extract form attributes
let form_selector = Selector::parse("form").unwrap();
if let Some(form) = document.select(&form_selector).next() {
println!("Form action: {}", form.value().attr("action").unwrap_or(""));
println!("Form method: {}", form.value().attr("method").unwrap_or(""));
}
// Extract input fields
let input_selector = Selector::parse("input").unwrap();
for input in document.select(&input_selector) {
let name = input.value().attr("name").unwrap_or("");
let input_type = input.value().attr("type").unwrap_or("");
let placeholder = input.value().attr("placeholder").unwrap_or("");
println!("Input: {} (type: {}, placeholder: {})", name, input_type, placeholder);
}
// Extract selected option
let selected_option_selector = Selector::parse("option[selected]").unwrap();
for option in document.select(&selected_option_selector) {
let value = option.value().attr("value").unwrap_or("");
let text = option.text().collect::<String>();
println!("Selected option: {} (value: {})", text, value);
}
}
Processing Lists and Navigation
Extract structured data from lists and navigation elements:
use scraper::{Html, Selector};
fn parse_navigation() {
let html = r#"
<nav class="main-nav">
<ul>
<li><a href="/">Home</a></li>
<li><a href="/about">About</a></li>
<li class="dropdown">
<a href="/services">Services</a>
<ul class="submenu">
<li><a href="/web-design">Web Design</a></li>
<li><a href="/development">Development</a></li>
</ul>
</li>
</ul>
</nav>
"#;
let document = Html::parse_document(html);
// Extract main navigation items
let nav_selector = Selector::parse("nav > ul > li > a").unwrap();
for link in document.select(&nav_selector) {
let href = link.value().attr("href").unwrap_or("");
let text = link.text().collect::<String>();
println!("Main nav: {} -> {}", text, href);
}
// Extract submenu items
let submenu_selector = Selector::parse(".submenu a").unwrap();
for link in document.select(&submenu_selector) {
let href = link.value().attr("href").unwrap_or("");
let text = link.text().collect::<String>();
println!("Submenu: {} -> {}", text, href);
}
}
Error Handling and Best Practices
Implement proper error handling when parsing HTML:
use scraper::{Html, Selector};
use std::error::Error;
use std::fmt;
#[derive(Debug)]
struct ParseError {
message: String,
}
impl fmt::Display for ParseError {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
write!(f, "Parse error: {}", self.message)
}
}
impl Error for ParseError {}
fn safe_parse_html(html: &str, selector_str: &str) -> Result<Vec<String>, Box<dyn Error>> {
// Parse the HTML document
let document = Html::parse_document(html);
// Create selector with error handling
let selector = Selector::parse(selector_str)
.map_err(|e| ParseError {
message: format!("Invalid CSS selector '{}': {:?}", selector_str, e),
})?;
// Extract text from matching elements
let results: Vec<String> = document
.select(&selector)
.map(|element| element.text().collect::<String>())
.collect();
if results.is_empty() {
return Err(Box::new(ParseError {
message: format!("No elements found for selector '{}'", selector_str),
}));
}
Ok(results)
}
fn main() {
let html = "<div class='content'><p>Hello World</p></div>";
match safe_parse_html(html, "p") {
Ok(results) => {
for result in results {
println!("Found: {}", result);
}
}
Err(e) => {
eprintln!("Error: {}", e);
}
}
}
Performance Considerations
The scraper crate is designed for performance, but here are some tips to optimize your HTML parsing:
use scraper::{Html, Selector};
use std::collections::HashMap;
// Pre-compile selectors for better performance
struct HtmlParser {
selectors: HashMap<String, Selector>,
}
impl HtmlParser {
fn new() -> Self {
let mut selectors = HashMap::new();
// Pre-compile commonly used selectors
selectors.insert("title".to_string(), Selector::parse("title").unwrap());
selectors.insert("links".to_string(), Selector::parse("a[href]").unwrap());
selectors.insert("images".to_string(), Selector::parse("img").unwrap());
selectors.insert("headings".to_string(), Selector::parse("h1, h2, h3, h4, h5, h6").unwrap());
HtmlParser { selectors }
}
fn parse_document(&self, html: &str) -> ParseResult {
let document = Html::parse_document(html);
let mut result = ParseResult::default();
// Extract title
if let Some(selector) = self.selectors.get("title") {
if let Some(title_element) = document.select(selector).next() {
result.title = title_element.text().collect::<String>();
}
}
// Extract links
if let Some(selector) = self.selectors.get("links") {
for link in document.select(selector) {
let href = link.value().attr("href").unwrap_or("").to_string();
let text = link.text().collect::<String>();
result.links.push((text, href));
}
}
result
}
}
#[derive(Default)]
struct ParseResult {
title: String,
links: Vec<(String, String)>,
}
Integration with Other Tools
While the scraper crate handles static HTML parsing excellently, for dynamic content that requires JavaScript execution, you might need to combine it with browser automation tools. For dynamic web scraping scenarios, consider using tools like Puppeteer for handling AJAX requests or browser automation for single page applications.
Conclusion
The scraper crate provides a powerful and efficient solution for HTML parsing in Rust applications. Its CSS selector support, combined with Rust's performance characteristics, makes it an excellent choice for web scraping projects. By following the patterns and examples shown in this guide, you can build robust HTML parsing applications that handle various document structures and extraction requirements effectively.
Whether you're building a web scraper, processing HTML documents, or extracting structured data from web pages, the scraper crate offers the tools and flexibility needed to accomplish your goals efficiently in Rust.