scraper
is a Rust crate for HTML parsing and querying, similar in many ways to Python's Beautiful Soup or JavaScript's Cheerio. It is built on top of the html5ever
library, which is part of the Servo project. html5ever
is designed to be compliant with the HTML5 specification, which includes handling character encodings.
In HTML5, the character encoding can be specified in a few different ways:
- An HTTP
Content-Type
header with acharset
parameter. - A
meta
tag within the HTML itself. - The encoding rules defined in the HTML5 specification if the above are not specified.
When scraper
processes an HTML document, the html5ever
component will attempt to determine the correct character encoding using the above sources. If the encoding is specified, it will decode the document accordingly. If no encoding is specified, it will use a heuristic to guess the encoding or default to UTF-8, which is the recommended encoding for HTML5 documents.
Here's a simple example of how you might use scraper
to load and parse an HTML document. The handling of character encoding is abstracted away from you, the user of the scraper
crate; it is managed internally by the html5ever
parsing engine.
extern crate scraper;
use scraper::{Html, Selector};
fn main() {
// This is a simple string, but you could load HTML content from a webpage.
// Assume the HTML is properly encoded.
let html = r#"
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Example HTML</title>
</head>
<body>
<h1>Hello World!</h1>
</body>
</html>
"#;
// Parse the HTML document
let document = Html::parse_document(html);
// Use a CSS selector to find the h1 tag
let selector = Selector::parse("h1").unwrap();
// Iterate over elements matching our selector
for element in document.select(&selector) {
// Grab the text from the selected node
let text = element.text().collect::<Vec<_>>().join("");
println!("{}", text);
}
}
In the above example, the HTML document includes a <meta>
tag that specifies UTF-8 as the character encoding. When the Html::parse_document
method is called, the document is parsed as a UTF-8 encoded string.
If you're dealing with a scenario where scraper
does not correctly interpret the encoding, you might need to perform the encoding detection and conversion manually before passing the HTML content to scraper
. In such a case, you could use the encoding_rs
crate, which is the Rust equivalent of the encoding
library used in Firefox, to convert the document to UTF-8 before parsing.
Here's a basic example of how to use encoding_rs
to decode a byte string with an arbitrary encoding to UTF-8:
extern crate encoding_rs;
use encoding_rs::*;
fn main() {
// Some bytes in a non-UTF-8 encoding, for example Windows-1252
let windows_1252_bytes = b"Hello, world! \x93\x94";
// Decode Windows-1252 bytes into a UTF-8 Rust string
let (cow, _encoding_used, _had_errors) = WINDOWS_1252.decode(windows_1252_bytes);
// Print out the converted string
println!("{}", cow);
}
In this example, the decode
method is used to convert a byte slice that is assumed to be in Windows-1252 encoding into a Cow<str>
, which can be used as a UTF-8 String
in Rust.
When dealing with web pages, you would typically fetch the byte content of the page, detect the encoding using HTTP headers or HTML meta tags, and then use encoding_rs
to convert it to UTF-8 before parsing it with scraper
.