How do I handle pagination when scraping multiple pages with Rust?
Handling pagination is a crucial skill when scraping websites that split their content across multiple pages. Rust offers excellent tools for efficient pagination handling through asynchronous programming and robust HTTP clients. This guide covers various pagination patterns and implementation strategies using popular Rust crates.
Understanding Common Pagination Patterns
Before diving into implementation, it's important to recognize the different types of pagination you'll encounter:
- Numbered pagination - Pages with explicit page numbers (1, 2, 3...)
- Next/Previous buttons - Sequential navigation links
- Offset-based pagination - Using URL parameters like
?offset=20&limit=10
- Cursor-based pagination - Using tokens or IDs for the next page
- Infinite scroll - Dynamic content loading via AJAX requests
Setting Up Your Rust Environment
First, add the necessary dependencies to your Cargo.toml
:
[dependencies]
reqwest = { version = "0.11", features = ["json"] }
tokio = { version = "1.0", features = ["full"] }
scraper = "0.17"
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
url = "2.4"
futures = "0.3"
Basic Pagination Structure
Here's a foundational structure for handling pagination in Rust:
use reqwest::Client;
use scraper::{Html, Selector};
use std::error::Error;
use tokio::time::{sleep, Duration};
#[derive(Debug)]
pub struct PaginationScraper {
client: Client,
base_url: String,
delay: Duration,
}
impl PaginationScraper {
pub fn new(base_url: String, delay_ms: u64) -> Self {
let client = Client::builder()
.user_agent("Mozilla/5.0 (compatible; RustScraper/1.0)")
.timeout(Duration::from_secs(30))
.build()
.expect("Failed to create HTTP client");
Self {
client,
base_url,
delay: Duration::from_millis(delay_ms),
}
}
pub async fn scrape_paginated_content(&self) -> Result<Vec<String>, Box<dyn Error>> {
let mut all_data = Vec::new();
let mut page = 1;
loop {
println!("Scraping page {}", page);
let url = format!("{}?page={}", self.base_url, page);
let response = self.client.get(&url).send().await?;
if !response.status().is_success() {
break;
}
let html = response.text().await?;
let document = Html::parse_document(&html);
let data = self.extract_page_data(&document);
if data.is_empty() {
break; // No more data, we've reached the end
}
all_data.extend(data);
page += 1;
// Respectful delay between requests
sleep(self.delay).await;
}
Ok(all_data)
}
fn extract_page_data(&self, document: &Html) -> Vec<String> {
let selector = Selector::parse(".item").unwrap();
document
.select(&selector)
.map(|element| element.text().collect::<String>())
.collect()
}
}
Handling Different Pagination Types
1. Numbered Pagination with Maximum Pages
When you know the total number of pages or can detect the last page:
impl PaginationScraper {
pub async fn scrape_with_max_pages(&self, max_pages: usize) -> Result<Vec<String>, Box<dyn Error>> {
let mut all_data = Vec::new();
for page in 1..=max_pages {
let url = format!("{}?page={}", self.base_url, page);
match self.fetch_page_data(&url).await {
Ok(data) => {
if data.is_empty() {
break; // Early termination if no data
}
all_data.extend(data);
}
Err(e) => {
eprintln!("Error fetching page {}: {}", page, e);
continue;
}
}
sleep(self.delay).await;
}
Ok(all_data)
}
async fn fetch_page_data(&self, url: &str) -> Result<Vec<String>, Box<dyn Error>> {
let response = self.client.get(url).send().await?;
let html = response.text().await?;
let document = Html::parse_document(&html);
Ok(self.extract_page_data(&document))
}
}
2. Next Button Navigation
For pagination that relies on "Next" buttons or links:
impl PaginationScraper {
pub async fn scrape_with_next_links(&self) -> Result<Vec<String>, Box<dyn Error>> {
let mut all_data = Vec::new();
let mut current_url = self.base_url.clone();
loop {
let response = self.client.get(¤t_url).send().await?;
let html = response.text().await?;
let document = Html::parse_document(&html);
// Extract data from current page
let page_data = self.extract_page_data(&document);
if page_data.is_empty() {
break;
}
all_data.extend(page_data);
// Find next page URL
if let Some(next_url) = self.find_next_page_url(&document, ¤t_url)? {
current_url = next_url;
sleep(self.delay).await;
} else {
break; // No more pages
}
}
Ok(all_data)
}
fn find_next_page_url(&self, document: &Html, base_url: &str) -> Result<Option<String>, Box<dyn Error>> {
let next_selector = Selector::parse("a[rel='next'], .next, .pagination-next").unwrap();
if let Some(next_element) = document.select(&next_selector).next() {
if let Some(href) = next_element.value().attr("href") {
let url = url::Url::parse(base_url)?;
let next_url = url.join(href)?;
return Ok(Some(next_url.to_string()));
}
}
Ok(None)
}
}
3. Offset-Based Pagination
For APIs or sites using offset/limit parameters:
impl PaginationScraper {
pub async fn scrape_with_offset(&self, limit: usize) -> Result<Vec<String>, Box<dyn Error>> {
let mut all_data = Vec::new();
let mut offset = 0;
loop {
let url = format!("{}?offset={}&limit={}", self.base_url, offset, limit);
let page_data = self.fetch_page_data(&url).await?;
if page_data.is_empty() || page_data.len() < limit {
all_data.extend(page_data);
break; // Last page or no more data
}
all_data.extend(page_data);
offset += limit;
sleep(self.delay).await;
}
Ok(all_data)
}
}
Advanced Concurrent Pagination
For better performance, you can process multiple pages concurrently while respecting rate limits:
use futures::stream::{self, StreamExt};
use std::sync::Arc;
use tokio::sync::Semaphore;
impl PaginationScraper {
pub async fn scrape_concurrent_pages(&self, max_pages: usize, concurrency: usize) -> Result<Vec<String>, Box<dyn Error>> {
let semaphore = Arc::new(Semaphore::new(concurrency));
let page_numbers: Vec<usize> = (1..=max_pages).collect();
let results = stream::iter(page_numbers)
.map(|page| {
let client = self.client.clone();
let base_url = self.base_url.clone();
let delay = self.delay;
let semaphore = semaphore.clone();
async move {
let _permit = semaphore.acquire().await.unwrap();
let url = format!("{}?page={}", base_url, page);
sleep(delay).await; // Rate limiting
match client.get(&url).send().await {
Ok(response) => {
match response.text().await {
Ok(html) => {
let document = Html::parse_document(&html);
let selector = Selector::parse(".item").unwrap();
let data: Vec<String> = document
.select(&selector)
.map(|element| element.text().collect::<String>())
.collect();
Ok((page, data))
}
Err(e) => Err(format!("Failed to read response for page {}: {}", page, e))
}
}
Err(e) => Err(format!("Failed to fetch page {}: {}", page, e))
}
}
})
.buffer_unordered(concurrency)
.collect::<Vec<_>>()
.await;
let mut all_data = Vec::new();
for result in results {
match result {
Ok((page, data)) => {
println!("Successfully scraped page {}", page);
all_data.extend(data);
}
Err(e) => eprintln!("Error: {}", e),
}
}
Ok(all_data)
}
}
Handling Dynamic Content and AJAX Pagination
For sites that load content dynamically, you might need to interact with JavaScript-rendered content. While Rust doesn't have native browser automation like Puppeteer for JavaScript navigation, you can use headless Chrome through the chromiumoxide
crate:
// Add to Cargo.toml:
// chromiumoxide = "0.5"
use chromiumoxide::{Browser, BrowserConfig};
pub struct DynamicPaginationScraper {
browser: Browser,
}
impl DynamicPaginationScraper {
pub async fn new() -> Result<Self, Box<dyn Error>> {
let (browser, mut handler) = Browser::launch(BrowserConfig::builder().build()?).await?;
// Spawn the handler
tokio::spawn(async move {
while let Some(h) = handler.next().await {
if h.is_err() {
break;
}
}
});
Ok(Self { browser })
}
pub async fn scrape_dynamic_pagination(&self, base_url: &str) -> Result<Vec<String>, Box<dyn Error>> {
let page = self.browser.new_page("about:blank").await?;
page.goto(base_url).await?;
let mut all_data = Vec::new();
loop {
// Wait for content to load
page.wait_for_selector(".item").await?;
// Extract data
let items = page.evaluate("Array.from(document.querySelectorAll('.item')).map(el => el.textContent)").await?;
let page_data: Vec<String> = items.into_value()?;
if page_data.is_empty() {
break;
}
all_data.extend(page_data);
// Try to click next button
let next_button_exists = page.evaluate("document.querySelector('.next-button, .load-more') !== null").await?;
let has_next: bool = next_button_exists.into_value()?;
if !has_next {
break;
}
page.click(".next-button, .load-more").await?;
// Wait for new content
tokio::time::sleep(Duration::from_millis(2000)).await;
}
Ok(all_data)
}
}
Error Handling and Resilience
Robust pagination scraping requires proper error handling:
use std::time::Duration;
use tokio::time::sleep;
impl PaginationScraper {
pub async fn scrape_with_retry(&self, max_retries: usize) -> Result<Vec<String>, Box<dyn Error>> {
let mut all_data = Vec::new();
let mut page = 1;
loop {
let mut retries = 0;
let url = format!("{}?page={}", self.base_url, page);
loop {
match self.fetch_page_with_timeout(&url).await {
Ok(data) => {
if data.is_empty() {
return Ok(all_data); // End of pagination
}
all_data.extend(data);
break;
}
Err(e) => {
retries += 1;
if retries > max_retries {
eprintln!("Failed to fetch page {} after {} retries: {}", page, max_retries, e);
return Ok(all_data); // Return what we have so far
}
let backoff_duration = Duration::from_millis(1000 * 2_u64.pow(retries as u32));
eprintln!("Retry {} for page {} after {:?}", retries, page, backoff_duration);
sleep(backoff_duration).await;
}
}
}
page += 1;
sleep(self.delay).await;
}
}
async fn fetch_page_with_timeout(&self, url: &str) -> Result<Vec<String>, Box<dyn Error>> {
let response = tokio::time::timeout(
Duration::from_secs(30),
self.client.get(url).send()
).await??;
if !response.status().is_success() {
return Err(format!("HTTP error: {}", response.status()).into());
}
let html = response.text().await?;
let document = Html::parse_document(&html);
Ok(self.extract_page_data(&document))
}
}
Complete Working Example
Here's a complete example that ties everything together:
use reqwest::Client;
use scraper::{Html, Selector};
use std::error::Error;
use tokio::time::{sleep, Duration};
#[tokio::main]
async fn main() -> Result<(), Box<dyn Error>> {
let scraper = PaginationScraper::new(
"https://example.com/products".to_string(),
1000 // 1 second delay
);
println!("Starting pagination scraping...");
let all_data = scraper.scrape_with_retry(3).await?;
println!("Scraped {} items across all pages", all_data.len());
// Process your data
for (index, item) in all_data.iter().enumerate() {
println!("Item {}: {}", index + 1, item);
}
Ok(())
}
Working with APIs and JSON Responses
When scraping APIs that return JSON data with pagination:
use serde::{Deserialize, Serialize};
#[derive(Debug, Deserialize)]
struct ApiResponse {
data: Vec<Item>,
pagination: PaginationInfo,
}
#[derive(Debug, Deserialize)]
struct Item {
id: u64,
title: String,
description: Option<String>,
}
#[derive(Debug, Deserialize)]
struct PaginationInfo {
current_page: u32,
last_page: u32,
next_page_url: Option<String>,
}
impl PaginationScraper {
pub async fn scrape_json_api(&self) -> Result<Vec<Item>, Box<dyn Error>> {
let mut all_items = Vec::new();
let mut current_url = Some(self.base_url.clone());
while let Some(url) = current_url {
let response = self.client.get(&url).send().await?;
let api_response: ApiResponse = response.json().await?;
all_items.extend(api_response.data);
current_url = api_response.pagination.next_page_url;
if current_url.is_some() {
sleep(self.delay).await;
}
}
Ok(all_items)
}
}
Best Practices and Performance Tips
- Respect Rate Limits: Always implement delays between requests to avoid overwhelming the server
- Handle Errors Gracefully: Implement retry logic with exponential backoff
- Use Connection Pooling: The
reqwest::Client
automatically handles connection reuse - Monitor Memory Usage: For large datasets, consider processing pages in batches
- Implement Caching: Store previously scraped pages to avoid re-scraping during development
- Follow robots.txt: Check the website's robots.txt file for scraping guidelines
Debugging and Monitoring
Add logging to track your scraping progress:
use log::{info, warn, error};
impl PaginationScraper {
pub async fn scrape_with_logging(&self) -> Result<Vec<String>, Box<dyn Error>> {
let mut all_data = Vec::new();
let mut page = 1;
info!("Starting pagination scraping from: {}", self.base_url);
loop {
let url = format!("{}?page={}", self.base_url, page);
info!("Fetching page {}: {}", page, url);
match self.fetch_page_data(&url).await {
Ok(data) => {
if data.is_empty() {
info!("No more data found on page {}, stopping", page);
break;
}
info!("Successfully scraped {} items from page {}", data.len(), page);
all_data.extend(data);
}
Err(e) => {
error!("Failed to fetch page {}: {}", page, e);
break;
}
}
page += 1;
sleep(self.delay).await;
}
info!("Scraping completed. Total items: {}", all_data.len());
Ok(all_data)
}
}
Handling Complex Pagination Scenarios
Infinite Scroll with Load More Buttons
Some sites use "Load More" buttons that trigger AJAX requests. For these, you'll need to monitor network requests similar to handling AJAX requests in browser automation:
impl DynamicPaginationScraper {
pub async fn scrape_infinite_scroll(&self, base_url: &str) -> Result<Vec<String>, Box<dyn Error>> {
let page = self.browser.new_page("about:blank").await?;
page.goto(base_url).await?;
let mut all_data = Vec::new();
let mut previous_count = 0;
loop {
// Wait for items to load
page.wait_for_selector(".item").await?;
// Count current items
let current_count: usize = page.evaluate("document.querySelectorAll('.item').length").await?.into_value()?;
if current_count == previous_count {
// No new items loaded, we're done
break;
}
// Extract new items only
let items_script = format!(
"Array.from(document.querySelectorAll('.item')).slice({}).map(el => el.textContent)",
previous_count
);
let new_items: Vec<String> = page.evaluate(&items_script).await?.into_value()?;
all_data.extend(new_items);
previous_count = current_count;
// Try to load more
if page.evaluate("document.querySelector('.load-more') !== null").await?.into_value()? {
page.click(".load-more").await?;
tokio::time::sleep(Duration::from_millis(2000)).await;
} else {
break;
}
}
Ok(all_data)
}
}
Conclusion
Rust provides excellent tools for handling pagination in web scraping projects. The combination of reqwest
for HTTP requests, scraper
for HTML parsing, and tokio
for asynchronous programming creates a powerful foundation for efficient pagination handling. Whether dealing with simple numbered pages or complex dynamic content, the patterns shown in this guide will help you build robust and performant scraping solutions.
Remember to always scrape responsibly, respect website terms of service, and implement appropriate delays and error handling to maintain good relationships with the sites you're scraping. With Rust's memory safety and performance characteristics, you can build scalable scraping solutions that handle large amounts of paginated data efficiently.