scraper
is a Rust crate for parsing HTML documents and querying elements within them, similar to libraries like BeautifulSoup in Python or Nokogiri in Ruby. It is built on top of html5ever
and selectors
crates, which are part of the Servo project. The scraper
crate provides a simple yet powerful interface to select and manipulate HTML elements using CSS selectors.
Here are some of the methods that scraper
provides for selecting elements:
- Selecting elements with
select
: Theselect
method is used on anElementRef
or aSelector
to find all descendant elements that match a CSS selector.
use scraper::{Html, Selector};
fn main() {
let html = r#"<div><p>Foo</p><p>Bar</p></div>"#;
let document = Html::parse_document(html);
let selector = Selector::parse("p").unwrap();
for element in document.select(&selector) {
println!("{}", element.inner_html());
}
}
- Selecting a single element with
select_first
: To select the first matching element, you can use theselect_first
method. It returns anOption<ElementRef>
since there may be no matching element.
// Assuming html and document are defined as in the previous example
let first_p = document.select(&selector).next();
if let Some(element) = first_p {
println!("The first <p> element is: {}", element.inner_html());
}
- Getting an element's text with
text
: To get the text content of an element, you can use thetext
method, which returns an iterator over the text nodes.
// Assuming element is an ElementRef as obtained in previous examples
for text_node in element.text() {
println!("{}", text_node);
}
- Navigating with parent, next_sibling, and prev_sibling: To navigate the HTML tree, you can use the
parent
,next_sibling
, andprev_sibling
methods provided byElementRef
.
// Assuming element is an ElementRef as obtained in previous examples
if let Some(parent) = element.parent() {
println!("Parent HTML: {}", parent.html());
}
if let Some(next_sibling) = element.next_sibling() {
println!("Next sibling HTML: {}", next_sibling.html());
}
if let Some(prev_sibling) = element.prev_sibling() {
println!("Previous sibling HTML: {}", prev_sibling.html());
}
- Accessing element attributes with
value
: To get the value of an attribute, use thevalue
method on anElementRef
.
// Assuming element is an ElementRef as obtained in previous examples
if let Some(class_attr) = element.value().attr("class") {
println!("Class attribute: {}", class_attr);
}
These are some of the core methods provided by the scraper
crate for selecting and navigating HTML elements in Rust. The crate also provides methods for creating and manipulating Html
and ElementRef
objects, as well as for serializing them back to HTML strings, among other utilities.