Table of contents

How do I handle pagination when scraping multiple pages with Rust?

Handling pagination is a crucial skill when scraping websites that split their content across multiple pages. Rust offers excellent tools for efficient pagination handling through asynchronous programming and robust HTTP clients. This guide covers various pagination patterns and implementation strategies using popular Rust crates.

Understanding Common Pagination Patterns

Before diving into implementation, it's important to recognize the different types of pagination you'll encounter:

  1. Numbered pagination - Pages with explicit page numbers (1, 2, 3...)
  2. Next/Previous buttons - Sequential navigation links
  3. Offset-based pagination - Using URL parameters like ?offset=20&limit=10
  4. Cursor-based pagination - Using tokens or IDs for the next page
  5. Infinite scroll - Dynamic content loading via AJAX requests

Setting Up Your Rust Environment

First, add the necessary dependencies to your Cargo.toml:

[dependencies]
reqwest = { version = "0.11", features = ["json"] }
tokio = { version = "1.0", features = ["full"] }
scraper = "0.17"
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
url = "2.4"
futures = "0.3"

Basic Pagination Structure

Here's a foundational structure for handling pagination in Rust:

use reqwest::Client;
use scraper::{Html, Selector};
use std::error::Error;
use tokio::time::{sleep, Duration};

#[derive(Debug)]
pub struct PaginationScraper {
    client: Client,
    base_url: String,
    delay: Duration,
}

impl PaginationScraper {
    pub fn new(base_url: String, delay_ms: u64) -> Self {
        let client = Client::builder()
            .user_agent("Mozilla/5.0 (compatible; RustScraper/1.0)")
            .timeout(Duration::from_secs(30))
            .build()
            .expect("Failed to create HTTP client");

        Self {
            client,
            base_url,
            delay: Duration::from_millis(delay_ms),
        }
    }

    pub async fn scrape_paginated_content(&self) -> Result<Vec<String>, Box<dyn Error>> {
        let mut all_data = Vec::new();
        let mut page = 1;

        loop {
            println!("Scraping page {}", page);

            let url = format!("{}?page={}", self.base_url, page);
            let response = self.client.get(&url).send().await?;

            if !response.status().is_success() {
                break;
            }

            let html = response.text().await?;
            let document = Html::parse_document(&html);

            let data = self.extract_page_data(&document);

            if data.is_empty() {
                break; // No more data, we've reached the end
            }

            all_data.extend(data);
            page += 1;

            // Respectful delay between requests
            sleep(self.delay).await;
        }

        Ok(all_data)
    }

    fn extract_page_data(&self, document: &Html) -> Vec<String> {
        let selector = Selector::parse(".item").unwrap();
        document
            .select(&selector)
            .map(|element| element.text().collect::<String>())
            .collect()
    }
}

Handling Different Pagination Types

1. Numbered Pagination with Maximum Pages

When you know the total number of pages or can detect the last page:

impl PaginationScraper {
    pub async fn scrape_with_max_pages(&self, max_pages: usize) -> Result<Vec<String>, Box<dyn Error>> {
        let mut all_data = Vec::new();

        for page in 1..=max_pages {
            let url = format!("{}?page={}", self.base_url, page);

            match self.fetch_page_data(&url).await {
                Ok(data) => {
                    if data.is_empty() {
                        break; // Early termination if no data
                    }
                    all_data.extend(data);
                }
                Err(e) => {
                    eprintln!("Error fetching page {}: {}", page, e);
                    continue;
                }
            }

            sleep(self.delay).await;
        }

        Ok(all_data)
    }

    async fn fetch_page_data(&self, url: &str) -> Result<Vec<String>, Box<dyn Error>> {
        let response = self.client.get(url).send().await?;
        let html = response.text().await?;
        let document = Html::parse_document(&html);
        Ok(self.extract_page_data(&document))
    }
}

2. Next Button Navigation

For pagination that relies on "Next" buttons or links:

impl PaginationScraper {
    pub async fn scrape_with_next_links(&self) -> Result<Vec<String>, Box<dyn Error>> {
        let mut all_data = Vec::new();
        let mut current_url = self.base_url.clone();

        loop {
            let response = self.client.get(&current_url).send().await?;
            let html = response.text().await?;
            let document = Html::parse_document(&html);

            // Extract data from current page
            let page_data = self.extract_page_data(&document);
            if page_data.is_empty() {
                break;
            }
            all_data.extend(page_data);

            // Find next page URL
            if let Some(next_url) = self.find_next_page_url(&document, &current_url)? {
                current_url = next_url;
                sleep(self.delay).await;
            } else {
                break; // No more pages
            }
        }

        Ok(all_data)
    }

    fn find_next_page_url(&self, document: &Html, base_url: &str) -> Result<Option<String>, Box<dyn Error>> {
        let next_selector = Selector::parse("a[rel='next'], .next, .pagination-next").unwrap();

        if let Some(next_element) = document.select(&next_selector).next() {
            if let Some(href) = next_element.value().attr("href") {
                let url = url::Url::parse(base_url)?;
                let next_url = url.join(href)?;
                return Ok(Some(next_url.to_string()));
            }
        }

        Ok(None)
    }
}

3. Offset-Based Pagination

For APIs or sites using offset/limit parameters:

impl PaginationScraper {
    pub async fn scrape_with_offset(&self, limit: usize) -> Result<Vec<String>, Box<dyn Error>> {
        let mut all_data = Vec::new();
        let mut offset = 0;

        loop {
            let url = format!("{}?offset={}&limit={}", self.base_url, offset, limit);
            let page_data = self.fetch_page_data(&url).await?;

            if page_data.is_empty() || page_data.len() < limit {
                all_data.extend(page_data);
                break; // Last page or no more data
            }

            all_data.extend(page_data);
            offset += limit;
            sleep(self.delay).await;
        }

        Ok(all_data)
    }
}

Advanced Concurrent Pagination

For better performance, you can process multiple pages concurrently while respecting rate limits:

use futures::stream::{self, StreamExt};
use std::sync::Arc;
use tokio::sync::Semaphore;

impl PaginationScraper {
    pub async fn scrape_concurrent_pages(&self, max_pages: usize, concurrency: usize) -> Result<Vec<String>, Box<dyn Error>> {
        let semaphore = Arc::new(Semaphore::new(concurrency));
        let page_numbers: Vec<usize> = (1..=max_pages).collect();

        let results = stream::iter(page_numbers)
            .map(|page| {
                let client = self.client.clone();
                let base_url = self.base_url.clone();
                let delay = self.delay;
                let semaphore = semaphore.clone();

                async move {
                    let _permit = semaphore.acquire().await.unwrap();
                    let url = format!("{}?page={}", base_url, page);

                    sleep(delay).await; // Rate limiting

                    match client.get(&url).send().await {
                        Ok(response) => {
                            match response.text().await {
                                Ok(html) => {
                                    let document = Html::parse_document(&html);
                                    let selector = Selector::parse(".item").unwrap();
                                    let data: Vec<String> = document
                                        .select(&selector)
                                        .map(|element| element.text().collect::<String>())
                                        .collect();
                                    Ok((page, data))
                                }
                                Err(e) => Err(format!("Failed to read response for page {}: {}", page, e))
                            }
                        }
                        Err(e) => Err(format!("Failed to fetch page {}: {}", page, e))
                    }
                }
            })
            .buffer_unordered(concurrency)
            .collect::<Vec<_>>()
            .await;

        let mut all_data = Vec::new();
        for result in results {
            match result {
                Ok((page, data)) => {
                    println!("Successfully scraped page {}", page);
                    all_data.extend(data);
                }
                Err(e) => eprintln!("Error: {}", e),
            }
        }

        Ok(all_data)
    }
}

Handling Dynamic Content and AJAX Pagination

For sites that load content dynamically, you might need to interact with JavaScript-rendered content. While Rust doesn't have native browser automation like Puppeteer for JavaScript navigation, you can use headless Chrome through the chromiumoxide crate:

// Add to Cargo.toml:
// chromiumoxide = "0.5"

use chromiumoxide::{Browser, BrowserConfig};

pub struct DynamicPaginationScraper {
    browser: Browser,
}

impl DynamicPaginationScraper {
    pub async fn new() -> Result<Self, Box<dyn Error>> {
        let (browser, mut handler) = Browser::launch(BrowserConfig::builder().build()?).await?;

        // Spawn the handler
        tokio::spawn(async move {
            while let Some(h) = handler.next().await {
                if h.is_err() {
                    break;
                }
            }
        });

        Ok(Self { browser })
    }

    pub async fn scrape_dynamic_pagination(&self, base_url: &str) -> Result<Vec<String>, Box<dyn Error>> {
        let page = self.browser.new_page("about:blank").await?;
        page.goto(base_url).await?;

        let mut all_data = Vec::new();

        loop {
            // Wait for content to load
            page.wait_for_selector(".item").await?;

            // Extract data
            let items = page.evaluate("Array.from(document.querySelectorAll('.item')).map(el => el.textContent)").await?;
            let page_data: Vec<String> = items.into_value()?;

            if page_data.is_empty() {
                break;
            }

            all_data.extend(page_data);

            // Try to click next button
            let next_button_exists = page.evaluate("document.querySelector('.next-button, .load-more') !== null").await?;
            let has_next: bool = next_button_exists.into_value()?;

            if !has_next {
                break;
            }

            page.click(".next-button, .load-more").await?;

            // Wait for new content
            tokio::time::sleep(Duration::from_millis(2000)).await;
        }

        Ok(all_data)
    }
}

Error Handling and Resilience

Robust pagination scraping requires proper error handling:

use std::time::Duration;
use tokio::time::sleep;

impl PaginationScraper {
    pub async fn scrape_with_retry(&self, max_retries: usize) -> Result<Vec<String>, Box<dyn Error>> {
        let mut all_data = Vec::new();
        let mut page = 1;

        loop {
            let mut retries = 0;
            let url = format!("{}?page={}", self.base_url, page);

            loop {
                match self.fetch_page_with_timeout(&url).await {
                    Ok(data) => {
                        if data.is_empty() {
                            return Ok(all_data); // End of pagination
                        }
                        all_data.extend(data);
                        break;
                    }
                    Err(e) => {
                        retries += 1;
                        if retries > max_retries {
                            eprintln!("Failed to fetch page {} after {} retries: {}", page, max_retries, e);
                            return Ok(all_data); // Return what we have so far
                        }

                        let backoff_duration = Duration::from_millis(1000 * 2_u64.pow(retries as u32));
                        eprintln!("Retry {} for page {} after {:?}", retries, page, backoff_duration);
                        sleep(backoff_duration).await;
                    }
                }
            }

            page += 1;
            sleep(self.delay).await;
        }
    }

    async fn fetch_page_with_timeout(&self, url: &str) -> Result<Vec<String>, Box<dyn Error>> {
        let response = tokio::time::timeout(
            Duration::from_secs(30),
            self.client.get(url).send()
        ).await??;

        if !response.status().is_success() {
            return Err(format!("HTTP error: {}", response.status()).into());
        }

        let html = response.text().await?;
        let document = Html::parse_document(&html);
        Ok(self.extract_page_data(&document))
    }
}

Complete Working Example

Here's a complete example that ties everything together:

use reqwest::Client;
use scraper::{Html, Selector};
use std::error::Error;
use tokio::time::{sleep, Duration};

#[tokio::main]
async fn main() -> Result<(), Box<dyn Error>> {
    let scraper = PaginationScraper::new(
        "https://example.com/products".to_string(),
        1000 // 1 second delay
    );

    println!("Starting pagination scraping...");

    let all_data = scraper.scrape_with_retry(3).await?;

    println!("Scraped {} items across all pages", all_data.len());

    // Process your data
    for (index, item) in all_data.iter().enumerate() {
        println!("Item {}: {}", index + 1, item);
    }

    Ok(())
}

Working with APIs and JSON Responses

When scraping APIs that return JSON data with pagination:

use serde::{Deserialize, Serialize};

#[derive(Debug, Deserialize)]
struct ApiResponse {
    data: Vec<Item>,
    pagination: PaginationInfo,
}

#[derive(Debug, Deserialize)]
struct Item {
    id: u64,
    title: String,
    description: Option<String>,
}

#[derive(Debug, Deserialize)]
struct PaginationInfo {
    current_page: u32,
    last_page: u32,
    next_page_url: Option<String>,
}

impl PaginationScraper {
    pub async fn scrape_json_api(&self) -> Result<Vec<Item>, Box<dyn Error>> {
        let mut all_items = Vec::new();
        let mut current_url = Some(self.base_url.clone());

        while let Some(url) = current_url {
            let response = self.client.get(&url).send().await?;
            let api_response: ApiResponse = response.json().await?;

            all_items.extend(api_response.data);
            current_url = api_response.pagination.next_page_url;

            if current_url.is_some() {
                sleep(self.delay).await;
            }
        }

        Ok(all_items)
    }
}

Best Practices and Performance Tips

  1. Respect Rate Limits: Always implement delays between requests to avoid overwhelming the server
  2. Handle Errors Gracefully: Implement retry logic with exponential backoff
  3. Use Connection Pooling: The reqwest::Client automatically handles connection reuse
  4. Monitor Memory Usage: For large datasets, consider processing pages in batches
  5. Implement Caching: Store previously scraped pages to avoid re-scraping during development
  6. Follow robots.txt: Check the website's robots.txt file for scraping guidelines

Debugging and Monitoring

Add logging to track your scraping progress:

use log::{info, warn, error};

impl PaginationScraper {
    pub async fn scrape_with_logging(&self) -> Result<Vec<String>, Box<dyn Error>> {
        let mut all_data = Vec::new();
        let mut page = 1;

        info!("Starting pagination scraping from: {}", self.base_url);

        loop {
            let url = format!("{}?page={}", self.base_url, page);
            info!("Fetching page {}: {}", page, url);

            match self.fetch_page_data(&url).await {
                Ok(data) => {
                    if data.is_empty() {
                        info!("No more data found on page {}, stopping", page);
                        break;
                    }

                    info!("Successfully scraped {} items from page {}", data.len(), page);
                    all_data.extend(data);
                }
                Err(e) => {
                    error!("Failed to fetch page {}: {}", page, e);
                    break;
                }
            }

            page += 1;
            sleep(self.delay).await;
        }

        info!("Scraping completed. Total items: {}", all_data.len());
        Ok(all_data)
    }
}

Handling Complex Pagination Scenarios

Infinite Scroll with Load More Buttons

Some sites use "Load More" buttons that trigger AJAX requests. For these, you'll need to monitor network requests similar to handling AJAX requests in browser automation:

impl DynamicPaginationScraper {
    pub async fn scrape_infinite_scroll(&self, base_url: &str) -> Result<Vec<String>, Box<dyn Error>> {
        let page = self.browser.new_page("about:blank").await?;
        page.goto(base_url).await?;

        let mut all_data = Vec::new();
        let mut previous_count = 0;

        loop {
            // Wait for items to load
            page.wait_for_selector(".item").await?;

            // Count current items
            let current_count: usize = page.evaluate("document.querySelectorAll('.item').length").await?.into_value()?;

            if current_count == previous_count {
                // No new items loaded, we're done
                break;
            }

            // Extract new items only
            let items_script = format!(
                "Array.from(document.querySelectorAll('.item')).slice({}).map(el => el.textContent)",
                previous_count
            );
            let new_items: Vec<String> = page.evaluate(&items_script).await?.into_value()?;
            all_data.extend(new_items);

            previous_count = current_count;

            // Try to load more
            if page.evaluate("document.querySelector('.load-more') !== null").await?.into_value()? {
                page.click(".load-more").await?;
                tokio::time::sleep(Duration::from_millis(2000)).await;
            } else {
                break;
            }
        }

        Ok(all_data)
    }
}

Conclusion

Rust provides excellent tools for handling pagination in web scraping projects. The combination of reqwest for HTTP requests, scraper for HTML parsing, and tokio for asynchronous programming creates a powerful foundation for efficient pagination handling. Whether dealing with simple numbered pages or complex dynamic content, the patterns shown in this guide will help you build robust and performant scraping solutions.

Remember to always scrape responsibly, respect website terms of service, and implement appropriate delays and error handling to maintain good relationships with the sites you're scraping. With Rust's memory safety and performance characteristics, you can build scalable scraping solutions that handle large amounts of paginated data efficiently.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon