Yes, you can use the scraper
crate in combination with Rust's async/await
syntax, but you will need to use an asynchronous HTTP client because scraper
itself does not perform any network operations. It's used for parsing and querying HTML, usually with selectors in the style of CSS.
To fetch HTML content asynchronously, you can use an async HTTP client like reqwest
which supports async/await. Here is an example of how you might use scraper
with reqwest
to perform web scraping asynchronously:
First, add the necessary dependencies to your Cargo.toml
:
[dependencies]
scraper = "0.12"
reqwest = { version = "0.11", features = ["json"] }
tokio = { version = "1", features = ["full"] }
Now you can write an async function that fetches an HTML document and uses scraper
to parse it:
use scraper::{Html, Selector};
use reqwest;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// The URL you want to scrape
let url = "http://example.com";
// Fetch the HTML content using reqwest
let resp = reqwest::get(url).await?;
assert!(resp.status().is_success());
let body = resp.text().await?;
// Parse the HTML using scraper
let document = Html::parse_document(&body);
// Create a Selector for parsing
let selector = Selector::parse("h1").unwrap();
// Iterate over elements matching the selector
for element in document.select(&selector) {
let text = element.text().collect::<Vec<_>>();
println!("{:?}", text);
}
Ok(())
}
In this example:
- We are using the
tokio
runtime to execute the async code. - The
reqwest
crate is used to perform an async GET request to fetch the HTML page. - We parse the body of the response with
scraper
by creating aHtml
document. - We create a
Selector
to findh1
tags in the HTML. - We iterate over the elements matched by the selector and print their text content.
Remember to match the version numbers in the Cargo.toml
to the latest versions that are compatible with each other to ensure that you have the latest features and fixes.