When scraping websites using the Scraper crate in Rust, handling redirects may not be immediately straightforward because Scraper itself is primarily focused on parsing HTML and does not handle HTTP requests directly. Instead, it works with HTML text that you obtain separately, typically via an HTTP client.
To handle redirects, you'll need to use an HTTP client that supports following redirects. One of the most popular HTTP clients in Rust is reqwest
, which handles redirects by default.
Below is an example of how you can handle redirects when scraping with Scraper by using the reqwest
crate:
- Add the necessary dependencies in your
Cargo.toml
:
[dependencies]
scraper = "0.12"
reqwest = "0.11"
tokio = { version = "1", features = ["full"] }
- Use
reqwest
to make an HTTP GET request. By default,reqwest
will follow up to 10 redirects. After getting the final response, use Scraper to parse and extract data from the HTML.
Here's a simple example that demonstrates making a request to a URL and then using Scraper to find all the h1
tags:
use scraper::{Html, Selector};
use reqwest::Error;
#[tokio::main]
async fn main() -> Result<(), Error> {
// The URL you want to scrape
let url = "http://example.com";
// Use reqwest to perform the GET request
let res = reqwest::get(url).await?;
// Check if the request was successful and get the response text
if res.status().is_success() {
let body = res.text().await?;
// Parse the body text using Scraper
let document = Html::parse_document(&body);
let selector = Selector::parse("h1").unwrap();
// Iterate over elements matching our selector
for element in document.select(&selector) {
let h1_text = element.text().collect::<Vec<_>>();
println!("H1 Tag: {:?}", h1_text);
}
}
Ok(())
}
In the above code:
- We're using
tokio::main
to mark themain
function as asynchronous. - We make a GET request to our specified URL using
reqwest::get
. - If the response is successful, we get the text of the response.
- We parse the HTML response using Scraper's
Html::parse_document
. - We create a
Selector
to find allh1
elements. - We iterate over all the
h1
elements and print their text content.
If you need more control over redirect behavior, you can configure the reqwest
client by creating an instance of reqwest::Client
with custom settings. For example, you can set a custom redirect policy, disable redirects, or handle them manually.
Here's how you can create a reqwest
client that does not follow redirects:
use reqwest::{Client, Error};
#[tokio::main]
async fn main() -> Result<(), Error> {
let client = Client::builder()
.redirect(reqwest::redirect::Policy::none())
.build()?;
let res = client.get("http://example.com").send().await?;
// Handle the response and parsing with Scraper as shown before
// ...
Ok(())
}
In this setup, you can inspect the response and manually handle redirects by checking the Location
header and making a new request to the URL specified there. This gives you complete control over how redirects are processed.