When it comes to web scraping with Rust, the ecosystem provides several powerful libraries for parsing HTML. Two of the most popular and widely used libraries are scraper
and select.rs
. These libraries are built on top of html5ever
, which is Rust's HTML parsing library based on the HTML5 parsing algorithm.
1. scraper
scraper
is a high-level web scraping library that provides a simple interface for navigating and querying HTML documents. It is inspired by Python's BeautifulSoup library.
To use scraper
, you would typically start by sending an HTTP request to retrieve the HTML content (possibly using a library like reqwest
), then parse the HTML with scraper
, and finally, navigate and query the document using CSS selectors.
Here's a simple example of how to use scraper
:
use scraper::{Html, Selector};
fn main() {
// HTML content as a &str, usually fetched from a web page.
let html_content = r#"
<html>
<body>
<p class="message">Hello, world!</p>
</body>
</html>
"#;
// Parse the HTML document
let document = Html::parse_document(html_content);
// Create a Selector to find elements with the class "message"
let selector = Selector::parse(".message").unwrap();
// Iterate over elements matching the selector
for element in document.select(&selector) {
// Get the text from the element
let message_text = element.text().collect::<Vec<_>>().join("");
println!("Message text: {}", message_text);
}
}
2. select.rs
select.rs
is another library for parsing HTML, which is also based on html5ever
. It provides a jQuery-like interface for selecting and extracting data from HTML documents.
Here's an example of using select.rs
:
use select::document::Document;
use select::predicate::{Class, Name};
fn main() {
// HTML content
let html_content = r#"
<html>
<body>
<p class="message">Hello, select.rs!</p>
</body>
</html>
"#;
// Parse the HTML document
let document = Document::from(html_content);
// Find all <p> tags with the class "message"
for node in document.find(Class("message")).iter() {
// Print the text from each node
println!("{}", node.text());
}
}
Both libraries have their own strengths, and choosing between them often comes down to personal preference regarding their API design and the specific needs of your scraping project.
Additional Libraries and Tools
In addition to the HTML parsing libraries, you might also find the following tools and libraries useful in a Rust-based web scraping project:
reqwest
: A high-level HTTP client for making network requests.serde
: A framework for serializing and deserializing Rust data structures, useful for handling JSON APIs.regex
: Regular expression library for Rust, which can be useful for text manipulation and data extraction.
To add any of these libraries to your Rust project, you'll need to include them in your Cargo.toml
file under the [dependencies]
section:
[dependencies]
scraper = "0.12.0" # Use the latest version
select = "0.5.0" # Use the latest version
reqwest = "0.11.6" # Use the latest version
serde = { version = "1.0", features = ["derive"] }
regex = "1.5.4" # Use the latest version
Always check for the latest versions on crates.io to ensure you have the most up-to-date and secure dependencies.