How to Implement Data Validation and Sanitization in Rust Web Scraping?
Data validation and sanitization are crucial components of robust web scraping applications. When building web scrapers in Rust, you need to ensure that the extracted data is clean, properly formatted, and secure before processing or storing it. This guide covers comprehensive techniques for implementing data validation and sanitization in Rust web scraping projects.
Understanding Data Validation vs. Sanitization
Data Validation verifies that scraped data meets specific criteria and constraints, while Data Sanitization involves cleaning and transforming data to remove unwanted characters, normalize formats, and prevent security vulnerabilities.
Essential Rust Dependencies
First, add these essential crates to your Cargo.toml
:
[dependencies]
reqwest = { version = "0.11", features = ["json"] }
scraper = "0.17"
regex = "1.7"
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
validator = { version = "0.16", features = ["derive"] }
ammonia = "3.3"
url = "2.3"
chrono = { version = "0.4", features = ["serde"] }
Basic Data Validation Structure
Create a foundation for validating scraped data using Rust's type system and the validator
crate:
use serde::{Deserialize, Serialize};
use validator::{Validate, ValidationError};
use regex::Regex;
#[derive(Debug, Serialize, Deserialize, Validate)]
pub struct ScrapedProduct {
#[validate(length(min = 1, max = 255))]
pub name: String,
#[validate(range(min = 0.01, max = 999999.99))]
pub price: f64,
#[validate(email)]
pub contact_email: Option<String>,
#[validate(url)]
pub product_url: String,
#[validate(custom = "validate_phone")]
pub phone: Option<String>,
#[validate(length(min = 10, max = 2000))]
pub description: String,
}
fn validate_phone(phone: &str) -> Result<(), ValidationError> {
let phone_regex = Regex::new(r"^\+?[\d\s\-\(\)]{10,15}$").unwrap();
if phone_regex.is_match(phone) {
Ok(())
} else {
Err(ValidationError::new("invalid_phone_format"))
}
}
HTML Content Sanitization
Use the ammonia
crate to sanitize HTML content and prevent XSS attacks:
use ammonia::Builder;
use scraper::{Html, Selector};
pub struct HtmlSanitizer {
cleaner: Builder<'static>,
}
impl HtmlSanitizer {
pub fn new() -> Self {
let mut cleaner = Builder::default();
cleaner
.tags(hashset!["p", "br", "strong", "em", "ul", "ol", "li"])
.tag_attributes(hashmap![])
.url_schemes(hashset!["https"])
.link_rel(Some("noopener noreferrer"));
Self { cleaner }
}
pub fn sanitize_html(&self, html: &str) -> String {
self.cleaner.clean(html).to_string()
}
pub fn extract_clean_text(&self, html: &str) -> String {
let document = Html::parse_document(html);
let text_selector = Selector::parse("*").unwrap();
document
.select(&text_selector)
.filter_map(|element| {
let text = element.text().collect::<String>();
if !text.trim().is_empty() {
Some(self.sanitize_text(&text))
} else {
None
}
})
.collect::<Vec<_>>()
.join(" ")
}
pub fn sanitize_text(&self, text: &str) -> String {
text.trim()
.chars()
.filter(|c| c.is_alphanumeric() || c.is_whitespace() || ".,!?-()[]{}".contains(*c))
.collect::<String>()
.split_whitespace()
.collect::<Vec<_>>()
.join(" ")
}
}
Advanced Data Validation Patterns
Implement comprehensive validation for different data types commonly encountered in web scraping:
use chrono::{DateTime, Utc, NaiveDate};
use regex::Regex;
use std::collections::HashMap;
pub struct DataValidator {
email_regex: Regex,
url_regex: Regex,
phone_regex: Regex,
price_regex: Regex,
}
impl DataValidator {
pub fn new() -> Self {
Self {
email_regex: Regex::new(r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$").unwrap(),
url_regex: Regex::new(r"^https?://[^\s/$.?#].[^\s]*$").unwrap(),
phone_regex: Regex::new(r"^\+?[\d\s\-\(\)]{10,15}$").unwrap(),
price_regex: Regex::new(r"^\$?(\d{1,3}(,\d{3})*|\d+)(\.\d{2})?$").unwrap(),
}
}
pub fn validate_email(&self, email: &str) -> Result<String, String> {
let cleaned = email.trim().to_lowercase();
if self.email_regex.is_match(&cleaned) {
Ok(cleaned)
} else {
Err("Invalid email format".to_string())
}
}
pub fn validate_url(&self, url: &str) -> Result<String, String> {
let cleaned = url.trim();
if self.url_regex.is_match(cleaned) {
Ok(cleaned.to_string())
} else {
Err("Invalid URL format".to_string())
}
}
pub fn validate_price(&self, price_str: &str) -> Result<f64, String> {
let cleaned = price_str
.trim()
.replace('$', "")
.replace(',', "");
if self.price_regex.is_match(price_str) {
cleaned.parse::<f64>()
.map_err(|_| "Failed to parse price".to_string())
} else {
Err("Invalid price format".to_string())
}
}
pub fn validate_date(&self, date_str: &str) -> Result<NaiveDate, String> {
// Try multiple date formats
let formats = vec![
"%Y-%m-%d",
"%m/%d/%Y",
"%d/%m/%Y",
"%B %d, %Y",
"%d %B %Y",
];
for format in formats {
if let Ok(date) = NaiveDate::parse_from_str(date_str.trim(), format) {
return Ok(date);
}
}
Err("Invalid date format".to_string())
}
pub fn normalize_text(&self, text: &str) -> String {
text.trim()
.split_whitespace()
.collect::<Vec<_>>()
.join(" ")
.chars()
.filter(|c| !c.is_control())
.collect()
}
}
Complete Web Scraping Implementation
Here's a comprehensive example that combines scraping with validation and sanitization:
use reqwest::Client;
use scraper::{Html, Selector};
use std::error::Error;
use std::collections::HashMap;
pub struct WebScraperValidator {
client: Client,
validator: DataValidator,
sanitizer: HtmlSanitizer,
}
impl WebScraperValidator {
pub fn new() -> Self {
Self {
client: Client::new(),
validator: DataValidator::new(),
sanitizer: HtmlSanitizer::new(),
}
}
pub async fn scrape_and_validate_product(
&self,
url: &str,
) -> Result<ScrapedProduct, Box<dyn Error>> {
// Fetch the page
let response = self.client.get(url).send().await?;
let html = response.text().await?;
let document = Html::parse_document(&html);
// Extract data with selectors
let name = self.extract_and_validate_name(&document)?;
let price = self.extract_and_validate_price(&document)?;
let description = self.extract_and_validate_description(&document)?;
let contact_email = self.extract_and_validate_email(&document);
let phone = self.extract_and_validate_phone(&document);
let product = ScrapedProduct {
name,
price,
contact_email,
product_url: self.validator.validate_url(url)?,
phone,
description,
};
// Validate the entire struct
product.validate()
.map_err(|e| format!("Validation failed: {:?}", e))?;
Ok(product)
}
fn extract_and_validate_name(&self, document: &Html) -> Result<String, Box<dyn Error>> {
let name_selector = Selector::parse("h1, .product-title, [data-testid='product-name']")?;
let raw_name = document
.select(&name_selector)
.next()
.ok_or("Product name not found")?
.text()
.collect::<String>();
let sanitized_name = self.validator.normalize_text(&raw_name);
if sanitized_name.len() < 1 || sanitized_name.len() > 255 {
return Err("Product name length is invalid".into());
}
Ok(sanitized_name)
}
fn extract_and_validate_price(&self, document: &Html) -> Result<f64, Box<dyn Error>> {
let price_selector = Selector::parse(".price, .cost, [data-testid='price']")?;
let raw_price = document
.select(&price_selector)
.next()
.ok_or("Price not found")?
.text()
.collect::<String>();
self.validator.validate_price(&raw_price)
.map_err(|e| e.into())
}
fn extract_and_validate_description(&self, document: &Html) -> Result<String, Box<dyn Error>> {
let desc_selector = Selector::parse(".description, .product-description, .details")?;
let raw_description = document
.select(&desc_selector)
.next()
.ok_or("Description not found")?
.inner_html();
let sanitized_desc = self.sanitizer.sanitize_html(&raw_description);
let clean_text = self.sanitizer.extract_clean_text(&sanitized_desc);
if clean_text.len() < 10 || clean_text.len() > 2000 {
return Err("Description length is invalid".into());
}
Ok(clean_text)
}
fn extract_and_validate_email(&self, document: &Html) -> Option<String> {
let email_selector = Selector::parse("a[href^='mailto:']").ok()?;
document
.select(&email_selector)
.next()?
.value()
.attr("href")?
.strip_prefix("mailto:")
.and_then(|email| self.validator.validate_email(email).ok())
}
fn extract_and_validate_phone(&self, document: &Html) -> Option<String> {
let phone_selector = Selector::parse(".phone, .contact-phone, a[href^='tel:']").ok()?;
let raw_phone = document
.select(&phone_selector)
.next()?
.text()
.collect::<String>();
self.validator.validate_phone(&raw_phone).ok()
}
}
Error Handling and Logging
Implement comprehensive error handling for validation failures:
use std::fmt;
#[derive(Debug)]
pub enum ValidationError {
InvalidFormat(String),
OutOfRange(String),
MissingRequired(String),
SecurityViolation(String),
}
impl fmt::Display for ValidationError {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
match self {
ValidationError::InvalidFormat(msg) => write!(f, "Invalid format: {}", msg),
ValidationError::OutOfRange(msg) => write!(f, "Out of range: {}", msg),
ValidationError::MissingRequired(msg) => write!(f, "Missing required field: {}", msg),
ValidationError::SecurityViolation(msg) => write!(f, "Security violation: {}", msg),
}
}
}
impl Error for ValidationError {}
// Usage in main function
#[tokio::main]
async fn main() -> Result<(), Box<dyn Error>> {
let scraper = WebScraperValidator::new();
match scraper.scrape_and_validate_product("https://example.com/product/123").await {
Ok(product) => {
println!("Successfully scraped and validated: {:#?}", product);
}
Err(e) => {
eprintln!("Scraping failed: {}", e);
}
}
Ok(())
}
Security Best Practices
When implementing data validation and sanitization for web scraping:
- Always sanitize HTML content to prevent XSS attacks
- Validate all input data against expected patterns and ranges
- Use type-safe parsing whenever possible
- Implement rate limiting to avoid overwhelming target servers
- Log validation failures for debugging and monitoring
Similar to how you might handle timeouts in Puppeteer for JavaScript-based scraping, Rust web scrapers should implement robust timeout and error handling mechanisms to ensure data integrity.
Testing Validation Logic
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_email_validation() {
let validator = DataValidator::new();
assert!(validator.validate_email("test@example.com").is_ok());
assert!(validator.validate_email("invalid-email").is_err());
assert!(validator.validate_email("").is_err());
}
#[test]
fn test_price_validation() {
let validator = DataValidator::new();
assert_eq!(validator.validate_price("$123.45").unwrap(), 123.45);
assert_eq!(validator.validate_price("1,234.56").unwrap(), 1234.56);
assert!(validator.validate_price("invalid").is_err());
}
#[test]
fn test_html_sanitization() {
let sanitizer = HtmlSanitizer::new();
let malicious_html = r#"<script>alert('xss')</script><p>Safe content</p>"#;
let sanitized = sanitizer.sanitize_html(malicious_html);
assert!(!sanitized.contains("<script>"));
assert!(sanitized.contains("Safe content"));
}
}
Conclusion
Implementing robust data validation and sanitization in Rust web scraping requires a multi-layered approach combining type safety, regex validation, HTML sanitization, and comprehensive error handling. By leveraging Rust's powerful type system and validation libraries, you can build secure and reliable web scrapers that produce clean, validated data.
The techniques covered in this guide provide a solid foundation for handling data validation challenges in production web scraping applications. Remember to always validate data at multiple levels and sanitize any content that might pose security risks.
Just as handling authentication in Puppeteer requires careful implementation for browser-based scraping, Rust web scrapers need equally careful attention to data validation and security practices to ensure reliable operation.