How do I handle pagination when scraping multiple pages with Rust?

Handling pagination is a crucial skill when scraping websites that split their content across multiple pages. Rust offers excellent tools for efficient pagination handling through asynchronous programming and robust HTTP clients. This guide covers various pagination patterns and implementation strategies using popular Rust crates.

Understanding Common Pagination Patterns

Before diving into implementation, it's important to recognize the different types of pagination you'll encounter:

Numbered pagination - Pages with explicit page numbers (1, 2, 3...)
Next/Previous buttons - Sequential navigation links
Offset-based pagination - Using URL parameters like ?offset=20&limit=10
Cursor-based pagination - Using tokens or IDs for the next page
Infinite scroll - Dynamic content loading via AJAX requests

Setting Up Your Rust Environment

First, add the necessary dependencies to your Cargo.toml:

[dependencies]
reqwest = { version = "0.11", features = ["json"] }
tokio = { version = "1.0", features = ["full"] }
scraper = "0.17"
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
url = "2.4"
futures = "0.3"

Basic Pagination Structure

Here's a foundational structure for handling pagination in Rust:

use reqwest::Client;
use scraper::{Html, Selector};
use std::error::Error;
use tokio::time::{sleep, Duration};

#[derive(Debug)]
pub struct PaginationScraper {
    client: Client,
    base_url: String,
    delay: Duration,
}

impl PaginationScraper {
    pub fn new(base_url: String, delay_ms: u64) -> Self {
        let client = Client::builder()
            .user_agent("Mozilla/5.0 (compatible; RustScraper/1.0)")
            .timeout(Duration::from_secs(30))
            .build()
            .expect("Failed to create HTTP client");

        Self {
            client,
            base_url,
            delay: Duration::from_millis(delay_ms),
        }
    }

    pub async fn scrape_paginated_content(&self) -> Result<Vec<String>, Box<dyn Error>> {
        let mut all_data = Vec::new();
        let mut page = 1;

        loop {
            println!("Scraping page {}", page);

            let url = format!("{}?page={}", self.base_url, page);
            let response = self.client.get(&url).send().await?;

            if !response.status().is_success() {
                break;
            }

            let html = response.text().await?;
            let document = Html::parse_document(&html);

            let data = self.extract_page_data(&document);

            if data.is_empty() {
                break; // No more data, we've reached the end
            }

            all_data.extend(data);
            page += 1;

            // Respectful delay between requests
            sleep(self.delay).await;
        }

        Ok(all_data)
    }

    fn extract_page_data(&self, document: &Html) -> Vec<String> {
        let selector = Selector::parse(".item").unwrap();
        document
            .select(&selector)
            .map(|element| element.text().collect::<String>())
            .collect()
    }
}

Handling Different Pagination Types

1. Numbered Pagination with Maximum Pages

When you know the total number of pages or can detect the last page:

impl PaginationScraper {
    pub async fn scrape_with_max_pages(&self, max_pages: usize) -> Result<Vec<String>, Box<dyn Error>> {
        let mut all_data = Vec::new();

        for page in 1..=max_pages {
            let url = format!("{}?page={}", self.base_url, page);

            match self.fetch_page_data(&url).await {
                Ok(data) => {
                    if data.is_empty() {
                        break; // Early termination if no data
                    }
                    all_data.extend(data);
                }
                Err(e) => {
                    eprintln!("Error fetching page {}: {}", page, e);
                    continue;
                }
            }

            sleep(self.delay).await;
        }

        Ok(all_data)
    }

    async fn fetch_page_data(&self, url: &str) -> Result<Vec<String>, Box<dyn Error>> {
        let response = self.client.get(url).send().await?;
        let html = response.text().await?;
        let document = Html::parse_document(&html);
        Ok(self.extract_page_data(&document))
    }
}

2. Next Button Navigation

For pagination that relies on "Next" buttons or links:

impl PaginationScraper {
    pub async fn scrape_with_next_links(&self) -> Result<Vec<String>, Box<dyn Error>> {
        let mut all_data = Vec::new();
        let mut current_url = self.base_url.clone();

        loop {
            let response = self.client.get(&current_url).send().await?;
            let html = response.text().await?;
            let document = Html::parse_document(&html);

            // Extract data from current page
            let page_data = self.extract_page_data(&document);
            if page_data.is_empty() {
                break;
            }
            all_data.extend(page_data);

            // Find next page URL
            if let Some(next_url) = self.find_next_page_url(&document, &current_url)? {
                current_url = next_url;
                sleep(self.delay).await;
            } else {
                break; // No more pages
            }
        }

        Ok(all_data)
    }

    fn find_next_page_url(&self, document: &Html, base_url: &str) -> Result<Option<String>, Box<dyn Error>> {
        let next_selector = Selector::parse("a[rel='next'], .next, .pagination-next").unwrap();

        if let Some(next_element) = document.select(&next_selector).next() {
            if let Some(href) = next_element.value().attr("href") {
                let url = url::Url::parse(base_url)?;
                let next_url = url.join(href)?;
                return Ok(Some(next_url.to_string()));
            }
        }

        Ok(None)
    }
}

3. Offset-Based Pagination

For APIs or sites using offset/limit parameters:

impl PaginationScraper {
    pub async fn scrape_with_offset(&self, limit: usize) -> Result<Vec<String>, Box<dyn Error>> {
        let mut all_data = Vec::new();
        let mut offset = 0;

        loop {
            let url = format!("{}?offset={}&limit={}", self.base_url, offset, limit);
            let page_data = self.fetch_page_data(&url).await?;

            if page_data.is_empty() || page_data.len() < limit {
                all_data.extend(page_data);
                break; // Last page or no more data
            }

            all_data.extend(page_data);
            offset += limit;
            sleep(self.delay).await;
        }

        Ok(all_data)
    }
}

Advanced Concurrent Pagination

For better performance, you can process multiple pages concurrently while respecting rate limits:

use futures::stream::{self, StreamExt};
use std::sync::Arc;
use tokio::sync::Semaphore;

impl PaginationScraper {
    pub async fn scrape_concurrent_pages(&self, max_pages: usize, concurrency: usize) -> Result<Vec<String>, Box<dyn Error>> {
        let semaphore = Arc::new(Semaphore::new(concurrency));
        let page_numbers: Vec<usize> = (1..=max_pages).collect();

        let results = stream::iter(page_numbers)
            .map(|page| {
                let client = self.client.clone();
                let base_url = self.base_url.clone();
                let delay = self.delay;
                let semaphore = semaphore.clone();

                async move {
                    let _permit = semaphore.acquire().await.unwrap();
                    let url = format!("{}?page={}", base_url, page);

                    sleep(delay).await; // Rate limiting

                    match client.get(&url).send().await {
                        Ok(response) => {
                            match response.text().await {
                                Ok(html) => {
                                    let document = Html::parse_document(&html);
                                    let selector = Selector::parse(".item").unwrap();
                                    let data: Vec<String> = document
                                        .select(&selector)
                                        .map(|element| element.text().collect::<String>())
                                        .collect();
                                    Ok((page, data))
                                }
                                Err(e) => Err(format!("Failed to read response for page {}: {}", page, e))
                            }
                        }
                        Err(e) => Err(format!("Failed to fetch page {}: {}", page, e))
                    }
                }
            })
            .buffer_unordered(concurrency)
            .collect::<Vec<_>>()
            .await;

        let mut all_data = Vec::new();
        for result in results {
            match result {
                Ok((page, data)) => {
                    println!("Successfully scraped page {}", page);
                    all_data.extend(data);
                }
                Err(e) => eprintln!("Error: {}", e),
            }
        }

        Ok(all_data)
    }
}

Handling Dynamic Content and AJAX Pagination

For sites that load content dynamically, you might need to interact with JavaScript-rendered content. While Rust doesn't have native browser automation like Puppeteer for JavaScript navigation, you can use headless Chrome through the chromiumoxide crate:

// Add to Cargo.toml:
// chromiumoxide = "0.5"

use chromiumoxide::{Browser, BrowserConfig};

pub struct DynamicPaginationScraper {
    browser: Browser,
}

impl DynamicPaginationScraper {
    pub async fn new() -> Result<Self, Box<dyn Error>> {
        let (browser, mut handler) = Browser::launch(BrowserConfig::builder().build()?).await?;

        // Spawn the handler
        tokio::spawn(async move {
            while let Some(h) = handler.next().await {
                if h.is_err() {
                    break;
                }
            }
        });

        Ok(Self { browser })
    }

    pub async fn scrape_dynamic_pagination(&self, base_url: &str) -> Result<Vec<String>, Box<dyn Error>> {
        let page = self.browser.new_page("about:blank").await?;
        page.goto(base_url).await?;

        let mut all_data = Vec::new();

        loop {
            // Wait for content to load
            page.wait_for_selector(".item").await?;

            // Extract data
            let items = page.evaluate("Array.from(document.querySelectorAll('.item')).map(el => el.textContent)").await?;
            let page_data: Vec<String> = items.into_value()?;

            if page_data.is_empty() {
                break;
            }

            all_data.extend(page_data);

            // Try to click next button
            let next_button_exists = page.evaluate("document.querySelector('.next-button, .load-more') !== null").await?;
            let has_next: bool = next_button_exists.into_value()?;

            if !has_next {
                break;
            }

            page.click(".next-button, .load-more").await?;

            // Wait for new content
            tokio::time::sleep(Duration::from_millis(2000)).await;
        }

        Ok(all_data)
    }
}

Error Handling and Resilience

Robust pagination scraping requires proper error handling:

use std::time::Duration;
use tokio::time::sleep;

impl PaginationScraper {
    pub async fn scrape_with_retry(&self, max_retries: usize) -> Result<Vec<String>, Box<dyn Error>> {
        let mut all_data = Vec::new();
        let mut page = 1;

        loop {
            let mut retries = 0;
            let url = format!("{}?page={}", self.base_url, page);

            loop {
                match self.fetch_page_with_timeout(&url).await {
                    Ok(data) => {
                        if data.is_empty() {
                            return Ok(all_data); // End of pagination
                        }
                        all_data.extend(data);
                        break;
                    }
                    Err(e) => {
                        retries += 1;
                        if retries > max_retries {
                            eprintln!("Failed to fetch page {} after {} retries: {}", page, max_retries, e);
                            return Ok(all_data); // Return what we have so far
                        }

                        let backoff_duration = Duration::from_millis(1000 * 2_u64.pow(retries as u32));
                        eprintln!("Retry {} for page {} after {:?}", retries, page, backoff_duration);
                        sleep(backoff_duration).await;
                    }
                }
            }

            page += 1;
            sleep(self.delay).await;
        }
    }

    async fn fetch_page_with_timeout(&self, url: &str) -> Result<Vec<String>, Box<dyn Error>> {
        let response = tokio::time::timeout(
            Duration::from_secs(30),
            self.client.get(url).send()
        ).await??;

        if !response.status().is_success() {
            return Err(format!("HTTP error: {}", response.status()).into());
        }

        let html = response.text().await?;
        let document = Html::parse_document(&html);
        Ok(self.extract_page_data(&document))
    }
}

Complete Working Example

Here's a complete example that ties everything together:

use reqwest::Client;
use scraper::{Html, Selector};
use std::error::Error;
use tokio::time::{sleep, Duration};

#[tokio::main]
async fn main() -> Result<(), Box<dyn Error>> {
    let scraper = PaginationScraper::new(
        "https://example.com/products".to_string(),
        1000 // 1 second delay
    );

    println!("Starting pagination scraping...");

    let all_data = scraper.scrape_with_retry(3).await?;

    println!("Scraped {} items across all pages", all_data.len());

    // Process your data
    for (index, item) in all_data.iter().enumerate() {
        println!("Item {}: {}", index + 1, item);
    }

    Ok(())
}

Working with APIs and JSON Responses

When scraping APIs that return JSON data with pagination:

use serde::{Deserialize, Serialize};

#[derive(Debug, Deserialize)]
struct ApiResponse {
    data: Vec<Item>,
    pagination: PaginationInfo,
}

#[derive(Debug, Deserialize)]
struct Item {
    id: u64,
    title: String,
    description: Option<String>,
}

#[derive(Debug, Deserialize)]
struct PaginationInfo {
    current_page: u32,
    last_page: u32,
    next_page_url: Option<String>,
}

impl PaginationScraper {
    pub async fn scrape_json_api(&self) -> Result<Vec<Item>, Box<dyn Error>> {
        let mut all_items = Vec::new();
        let mut current_url = Some(self.base_url.clone());

        while let Some(url) = current_url {
            let response = self.client.get(&url).send().await?;
            let api_response: ApiResponse = response.json().await?;

            all_items.extend(api_response.data);
            current_url = api_response.pagination.next_page_url;

            if current_url.is_some() {
                sleep(self.delay).await;
            }
        }

        Ok(all_items)
    }
}

Best Practices and Performance Tips

Respect Rate Limits: Always implement delays between requests to avoid overwhelming the server
Handle Errors Gracefully: Implement retry logic with exponential backoff
Use Connection Pooling: The reqwest::Client automatically handles connection reuse
Monitor Memory Usage: For large datasets, consider processing pages in batches
Implement Caching: Store previously scraped pages to avoid re-scraping during development
Follow robots.txt: Check the website's robots.txt file for scraping guidelines

Debugging and Monitoring

Add logging to track your scraping progress:

use log::{info, warn, error};

impl PaginationScraper {
    pub async fn scrape_with_logging(&self) -> Result<Vec<String>, Box<dyn Error>> {
        let mut all_data = Vec::new();
        let mut page = 1;

        info!("Starting pagination scraping from: {}", self.base_url);

        loop {
            let url = format!("{}?page={}", self.base_url, page);
            info!("Fetching page {}: {}", page, url);

            match self.fetch_page_data(&url).await {
                Ok(data) => {
                    if data.is_empty() {
                        info!("No more data found on page {}, stopping", page);
                        break;
                    }

                    info!("Successfully scraped {} items from page {}", data.len(), page);
                    all_data.extend(data);
                }
                Err(e) => {
                    error!("Failed to fetch page {}: {}", page, e);
                    break;
                }
            }

            page += 1;
            sleep(self.delay).await;
        }

        info!("Scraping completed. Total items: {}", all_data.len());
        Ok(all_data)
    }
}

Handling Complex Pagination Scenarios

Infinite Scroll with Load More Buttons

Some sites use "Load More" buttons that trigger AJAX requests. For these, you'll need to monitor network requests similar to handling AJAX requests in browser automation:

impl DynamicPaginationScraper {
    pub async fn scrape_infinite_scroll(&self, base_url: &str) -> Result<Vec<String>, Box<dyn Error>> {
        let page = self.browser.new_page("about:blank").await?;
        page.goto(base_url).await?;

        let mut all_data = Vec::new();
        let mut previous_count = 0;

        loop {
            // Wait for items to load
            page.wait_for_selector(".item").await?;

            // Count current items
            let current_count: usize = page.evaluate("document.querySelectorAll('.item').length").await?.into_value()?;

            if current_count == previous_count {
                // No new items loaded, we're done
                break;
            }

            // Extract new items only
            let items_script = format!(
                "Array.from(document.querySelectorAll('.item')).slice({}).map(el => el.textContent)",
                previous_count
            );
            let new_items: Vec<String> = page.evaluate(&items_script).await?.into_value()?;
            all_data.extend(new_items);

            previous_count = current_count;

            // Try to load more
            if page.evaluate("document.querySelector('.load-more') !== null").await?.into_value()? {
                page.click(".load-more").await?;
                tokio::time::sleep(Duration::from_millis(2000)).await;
            } else {
                break;
            }
        }

        Ok(all_data)
    }
}

Conclusion

Rust provides excellent tools for handling pagination in web scraping projects. The combination of reqwest for HTTP requests, scraper for HTML parsing, and tokio for asynchronous programming creates a powerful foundation for efficient pagination handling. Whether dealing with simple numbered pages or complex dynamic content, the patterns shown in this guide will help you build robust and performant scraping solutions.

Remember to always scrape responsibly, respect website terms of service, and implement appropriate delays and error handling to maintain good relationships with the sites you're scraping. With Rust's memory safety and performance characteristics, you can build scalable scraping solutions that handle large amounts of paginated data efficiently.

Table of contents

How do I handle pagination when scraping multiple pages with Rust?

Understanding Common Pagination Patterns

Setting Up Your Rust Environment

Basic Pagination Structure

Handling Different Pagination Types

1. Numbered Pagination with Maximum Pages

2. Next Button Navigation

3. Offset-Based Pagination

Advanced Concurrent Pagination

Handling Dynamic Content and AJAX Pagination

Error Handling and Resilience

Complete Working Example

Working with APIs and JSON Responses

Best Practices and Performance Tips

Debugging and Monitoring

Handling Complex Pagination Scenarios

Infinite Scroll with Load More Buttons

Conclusion

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Related Questions

How to implement custom deserializers for scraped data in Rust?

What is the difference between serde_json and other JSON parsing libraries in Rust?

How can I scrape websites that use GraphQL APIs with Rust?

Get Started Now

Support