To extract links from a webpage using Reqwest, you will first need to fetch the webpage's HTML content and then parse it to extract the links. Reqwest is a Rust library used for making HTTP requests, so you will also need an HTML parsing library like scraper
to parse the HTML and extract the links.
Here is a step-by-step guide to extracting links from a webpage using Reqwest in Rust:
- Add dependencies to your
Cargo.toml
:
[dependencies]
reqwest = { version = "0.11", features = ["blocking"] }
scraper = "0.12"
- Write the Rust code to perform the following actions:
- Make an HTTP GET request to the webpage using
reqwest
. - Parse the response body as a string.
- Use the
scraper
crate to parse the HTML and select the anchor (<a>
) elements. - Extract the
href
attribute from each anchor element to get the links.
- Make an HTTP GET request to the webpage using
Here's an example code snippet:
use reqwest;
use scraper::{Html, Selector};
fn main() -> Result<(), Box<dyn std::error::Error>> {
// The URL of the webpage to scrape
let url = "https://example.com";
// Make a GET request to the URL
let response_body = reqwest::blocking::get(url)?.text()?;
// Parse the response body as HTML
let document = Html::parse_document(&response_body);
// Create a selector to find all anchor elements
let selector = Selector::parse("a").unwrap();
// Iterate over elements matching the selector
for element in document.select(&selector) {
// Try to get the href attribute
if let Some(href) = element.value().attr("href") {
println!("Found link: {}", href);
}
}
Ok(())
}
Make sure to handle errors appropriately in a production application, rather than using unwrap()
as shown in the example above.
The above example uses the blocking
feature of Reqwest, which is suitable for simple scripts or synchronous applications. If you are building an asynchronous application, you would use the asynchronous API provided by Reqwest.
Remember that web scraping should be done responsibly and ethically. Always check the website’s robots.txt
file and terms of service to ensure you are allowed to scrape it, and do not overload the website with a high volume of requests.