Scraper is a crate in Rust designed for web scraping tasks. It's a library that allows developers to parse HTML documents and extract data from them, which is useful for a variety of applications such as data mining, information retrieval, and automated testing.
Scraper is built on top of html5ever
and selectors
libraries, which are part of the Servo project. The html5ever
library provides high-performance parsing of HTML documents, while selectors
provides query capabilities to select elements using CSS selectors.
Here's how you might use Scraper in a Rust project:
- Add the Scraper dependency: First, you need to add the Scraper crate to your
Cargo.toml
file.
[dependencies]
scraper = "0.12"
Parse HTML: Use Scraper to parse an HTML document.
Select elements: After parsing, you can use CSS selectors to find elements in the document.
Here's an example of how to use Scraper to extract data from a simple HTML document:
use scraper::{Html, Selector};
fn main() {
// Sample HTML content
let html_content = r#"
<html>
<body>
<h1>Welcome to Scraper</h1>
<p>Scraper is useful for web scraping.</p>
<a href="http://example.com">Link to example.com</a>
</body>
</html>
"#;
// Parse the HTML document
let document = Html::parse_document(html_content);
// Create a Selector for the element you want to scrape
let h1_selector = Selector::parse("h1").unwrap();
let p_selector = Selector::parse("p").unwrap();
let link_selector = Selector::parse("a").unwrap();
// Use the Selector to find elements in the document
for element in document.select(&h1_selector) {
let text = element.text().collect::<Vec<_>>().join("");
println!("Heading text: {}", text);
}
for element in document.select(&p_selector) {
let text = element.text().collect::<Vec<_>>().join("");
println!("Paragraph text: {}", text);
}
for element in document.select(&link_selector) {
let text = element.text().collect::<Vec<_>>().join("");
let href = element.value().attr("href").unwrap();
println!("Link text: {}, href: {}", text, href);
}
}
In this example, we're parsing a string containing HTML and then extracting the text from the <h1>
tag, the <p>
tag, and the href
attribute from the <a>
tag.
In a real-world scenario, you might fetch HTML content from a website using an HTTP client library like reqwest
, and then parse and scrape the content with Scraper.
Keep in mind that web scraping must be done ethically and legally. Always check a website's robots.txt
file and Terms of Service to ensure that you're allowed to scrape it, and make sure not to overload the server with requests.