Table of contents

How to Scrape Data from Multiple Pages Using Selenium Pagination

Pagination is a common web design pattern that divides content across multiple pages. When web scraping, you'll often need to navigate through paginated results to collect all available data. Selenium provides powerful tools to handle various pagination patterns effectively.

Understanding Pagination Types

Before diving into implementation, it's important to understand the different types of pagination you might encounter:

1. Sequential Pagination

Traditional numbered pagination with "Next" and "Previous" buttons.

2. Infinite Scroll

Content loads dynamically as the user scrolls down the page.

3. Load More Button

A button that loads additional content when clicked.

4. URL-based Pagination

Pagination controlled through URL parameters (e.g., ?page=2).

Basic Pagination Scraping Strategy

Here's a comprehensive approach to scraping paginated data using Selenium:

Python Implementation

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, NoSuchElementException
import time

class PaginationScraper:
    def __init__(self, base_url):
        self.driver = webdriver.Chrome()
        self.base_url = base_url
        self.wait = WebDriverWait(self.driver, 10)
        self.all_data = []

    def scrape_current_page(self):
        """Extract data from the current page"""
        try:
            # Wait for content to load
            self.wait.until(EC.presence_of_element_located((By.CLASS_NAME, "item")))

            # Extract data from current page
            items = self.driver.find_elements(By.CLASS_NAME, "item")
            page_data = []

            for item in items:
                data = {
                    'title': item.find_element(By.CLASS_NAME, "title").text,
                    'price': item.find_element(By.CLASS_NAME, "price").text,
                    'description': item.find_element(By.CLASS_NAME, "description").text
                }
                page_data.append(data)

            return page_data
        except TimeoutException:
            print("Timeout waiting for page content")
            return []

    def has_next_page(self):
        """Check if there's a next page available"""
        try:
            next_button = self.driver.find_element(By.CLASS_NAME, "next-page")
            return next_button.is_enabled() and next_button.is_displayed()
        except NoSuchElementException:
            return False

    def go_to_next_page(self):
        """Navigate to the next page"""
        try:
            next_button = self.wait.until(
                EC.element_to_be_clickable((By.CLASS_NAME, "next-page"))
            )
            next_button.click()

            # Wait for new page to load
            time.sleep(2)
            return True
        except TimeoutException:
            print("Next button not found or not clickable")
            return False

    def scrape_all_pages(self):
        """Main method to scrape all pages"""
        self.driver.get(self.base_url)
        page_number = 1

        while True:
            print(f"Scraping page {page_number}")

            # Scrape current page
            page_data = self.scrape_current_page()
            self.all_data.extend(page_data)

            print(f"Found {len(page_data)} items on page {page_number}")

            # Check if there's a next page
            if not self.has_next_page():
                print("No more pages to scrape")
                break

            # Go to next page
            if not self.go_to_next_page():
                print("Failed to navigate to next page")
                break

            page_number += 1

            # Add delay to be respectful to the server
            time.sleep(1)

        return self.all_data

    def close(self):
        self.driver.quit()

# Usage example
scraper = PaginationScraper("https://example.com/products")
try:
    all_data = scraper.scrape_all_pages()
    print(f"Total items scraped: {len(all_data)}")
finally:
    scraper.close()

JavaScript Implementation

const { Builder, By, until } = require('selenium-webdriver');

class PaginationScraper {
    constructor(baseUrl) {
        this.driver = new Builder().forBrowser('chrome').build();
        this.baseUrl = baseUrl;
        this.allData = [];
    }

    async scrapeCurrentPage() {
        try {
            // Wait for content to load
            await this.driver.wait(until.elementsLocated(By.className('item')), 10000);

            // Extract data from current page
            const items = await this.driver.findElements(By.className('item'));
            const pageData = [];

            for (let item of items) {
                const title = await item.findElement(By.className('title')).getText();
                const price = await item.findElement(By.className('price')).getText();
                const description = await item.findElement(By.className('description')).getText();

                pageData.push({ title, price, description });
            }

            return pageData;
        } catch (error) {
            console.log('Error scraping current page:', error);
            return [];
        }
    }

    async hasNextPage() {
        try {
            const nextButton = await this.driver.findElement(By.className('next-page'));
            const isEnabled = await nextButton.isEnabled();
            const isDisplayed = await nextButton.isDisplayed();
            return isEnabled && isDisplayed;
        } catch (error) {
            return false;
        }
    }

    async goToNextPage() {
        try {
            const nextButton = await this.driver.wait(
                until.elementIsEnabled(this.driver.findElement(By.className('next-page'))),
                10000
            );
            await nextButton.click();

            // Wait for new page to load
            await this.driver.sleep(2000);
            return true;
        } catch (error) {
            console.log('Failed to navigate to next page:', error);
            return false;
        }
    }

    async scrapeAllPages() {
        await this.driver.get(this.baseUrl);
        let pageNumber = 1;

        while (true) {
            console.log(`Scraping page ${pageNumber}`);

            // Scrape current page
            const pageData = await this.scrapeCurrentPage();
            this.allData.push(...pageData);

            console.log(`Found ${pageData.length} items on page ${pageNumber}`);

            // Check if there's a next page
            if (!(await this.hasNextPage())) {
                console.log('No more pages to scrape');
                break;
            }

            // Go to next page
            if (!(await this.goToNextPage())) {
                console.log('Failed to navigate to next page');
                break;
            }

            pageNumber++;

            // Add delay to be respectful to the server
            await this.driver.sleep(1000);
        }

        return this.allData;
    }

    async close() {
        await this.driver.quit();
    }
}

// Usage example
(async () => {
    const scraper = new PaginationScraper('https://example.com/products');
    try {
        const allData = await scraper.scrapeAllPages();
        console.log(`Total items scraped: ${allData.length}`);
    } finally {
        await scraper.close();
    }
})();

Advanced Pagination Techniques

Handling URL-Based Pagination

For sites that use URL parameters for pagination:

def scrape_url_pagination(base_url, max_pages=None):
    driver = webdriver.Chrome()
    all_data = []
    page = 1

    while True:
        url = f"{base_url}?page={page}"
        driver.get(url)

        # Check if page has content
        try:
            WebDriverWait(driver, 5).until(
                EC.presence_of_element_located((By.CLASS_NAME, "item"))
            )
        except TimeoutException:
            print(f"No content found on page {page}")
            break

        # Scrape current page
        page_data = scrape_current_page(driver)
        if not page_data:
            break

        all_data.extend(page_data)
        page += 1

        if max_pages and page > max_pages:
            break

    driver.quit()
    return all_data

Handling Infinite Scroll

For pages with infinite scroll pagination:

def scrape_infinite_scroll(driver, scroll_pause_time=2):
    all_data = []
    last_height = driver.execute_script("return document.body.scrollHeight")

    while True:
        # Scroll down to bottom
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

        # Wait for new content to load
        time.sleep(scroll_pause_time)

        # Scrape newly loaded content
        page_data = scrape_current_page(driver)
        all_data.extend(page_data)

        # Calculate new scroll height
        new_height = driver.execute_script("return document.body.scrollHeight")

        if new_height == last_height:
            break

        last_height = new_height

    return all_data

Load More Button Pattern

For sites with "Load More" buttons:

def scrape_load_more_pattern(driver):
    all_data = []

    while True:
        # Scrape current visible items
        page_data = scrape_current_page(driver)
        all_data.extend(page_data)

        try:
            # Find and click load more button
            load_more_button = WebDriverWait(driver, 10).until(
                EC.element_to_be_clickable((By.CLASS_NAME, "load-more"))
            )
            load_more_button.click()

            # Wait for new content to load
            time.sleep(3)

        except TimeoutException:
            print("Load more button not found or not clickable")
            break

    return all_data

Best Practices for Pagination Scraping

1. Implement Robust Error Handling

from selenium.common.exceptions import StaleElementReferenceException

def safe_scrape_with_retry(driver, max_retries=3):
    for attempt in range(max_retries):
        try:
            return scrape_current_page(driver)
        except StaleElementReferenceException:
            print(f"Stale element reference, retrying... (attempt {attempt + 1})")
            time.sleep(2)
            continue
        except Exception as e:
            print(f"Error during scraping: {e}")
            if attempt == max_retries - 1:
                raise
    return []

2. Add Progress Tracking

def scrape_with_progress(scraper, estimated_pages=None):
    page_number = 1
    total_items = 0

    while scraper.has_next_page():
        page_data = scraper.scrape_current_page()
        total_items += len(page_data)

        if estimated_pages:
            progress = (page_number / estimated_pages) * 100
            print(f"Progress: {progress:.1f}% - Page {page_number}/{estimated_pages}")
        else:
            print(f"Scraped page {page_number} - {len(page_data)} items")

        scraper.go_to_next_page()
        page_number += 1

    return total_items

3. Handle Rate Limiting

import random

def scrape_with_rate_limiting(scraper, min_delay=1, max_delay=3):
    while scraper.has_next_page():
        page_data = scraper.scrape_current_page()

        # Random delay between requests
        delay = random.uniform(min_delay, max_delay)
        print(f"Waiting {delay:.1f} seconds before next page...")
        time.sleep(delay)

        scraper.go_to_next_page()

Common Pagination Challenges and Solutions

Challenge 1: Dynamic Content Loading

Some sites load content dynamically, making it difficult to determine when a page has fully loaded. Similar to how AJAX requests are handled in web scraping, you need to wait for specific elements:

def wait_for_dynamic_content(driver, timeout=10):
    """Wait for dynamic content to load completely"""
    try:
        # Wait for loading spinner to disappear
        WebDriverWait(driver, timeout).until(
            EC.invisibility_of_element_located((By.CLASS_NAME, "loading-spinner"))
        )

        # Wait for content to appear
        WebDriverWait(driver, timeout).until(
            EC.presence_of_element_located((By.CLASS_NAME, "content-item"))
        )

        return True
    except TimeoutException:
        return False

Challenge 2: Pagination State Management

Keep track of pagination state to handle complex navigation patterns:

class PaginationState:
    def __init__(self):
        self.current_page = 1
        self.total_pages = None
        self.items_per_page = None
        self.total_items = 0

    def update_from_page_info(self, driver):
        """Extract pagination info from page elements"""
        try:
            page_info = driver.find_element(By.CLASS_NAME, "page-info").text
            # Parse something like "Page 1 of 10 (100 items)"
            import re
            match = re.search(r'Page (\d+) of (\d+)', page_info)
            if match:
                self.current_page = int(match.group(1))
                self.total_pages = int(match.group(2))
        except NoSuchElementException:
            pass

Challenge 3: Handling JavaScript-Heavy Pagination

For sites with complex JavaScript pagination, you might need to execute custom JavaScript:

def click_pagination_with_js(driver, page_number):
    """Use JavaScript to navigate to specific page"""
    script = f"""
    var pageButton = document.querySelector('button[data-page="{page_number}"]');
    if (pageButton) {{
        pageButton.click();
        return true;
    }}
    return false;
    """
    return driver.execute_script(script)

Performance Optimization

Parallel Processing

For large-scale scraping, consider processing multiple pages in parallel:

from concurrent.futures import ThreadPoolExecutor
import threading

class ThreadSafePaginationScraper:
    def __init__(self, base_url, max_workers=3):
        self.base_url = base_url
        self.max_workers = max_workers
        self.data_lock = threading.Lock()
        self.all_data = []

    def scrape_page_range(self, start_page, end_page):
        """Scrape a range of pages in a single thread"""
        driver = webdriver.Chrome()
        thread_data = []

        try:
            for page in range(start_page, end_page + 1):
                url = f"{self.base_url}?page={page}"
                driver.get(url)

                page_data = self.scrape_current_page(driver)
                thread_data.extend(page_data)

                time.sleep(1)  # Rate limiting

            with self.data_lock:
                self.all_data.extend(thread_data)

        finally:
            driver.quit()

    def scrape_parallel(self, total_pages):
        """Scrape pages in parallel"""
        pages_per_thread = total_pages // self.max_workers

        with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
            futures = []

            for i in range(self.max_workers):
                start_page = i * pages_per_thread + 1
                end_page = (i + 1) * pages_per_thread

                if i == self.max_workers - 1:
                    end_page = total_pages

                future = executor.submit(self.scrape_page_range, start_page, end_page)
                futures.append(future)

            # Wait for all threads to complete
            for future in futures:
                future.result()

        return self.all_data

Optimizing Wait Times

Use explicit waits efficiently to minimize scraping time:

def smart_wait_for_next_page(driver, timeout=10):
    """Wait for next page to load using multiple indicators"""
    try:
        # Wait for old content to become stale
        old_element = driver.find_element(By.CLASS_NAME, "page-number")

        # Wait for page change
        WebDriverWait(driver, timeout).until(
            EC.staleness_of(old_element)
        )

        # Wait for new content to load
        WebDriverWait(driver, timeout).until(
            EC.presence_of_element_located((By.CLASS_NAME, "item"))
        )

        return True
    except TimeoutException:
        return False

Testing Your Pagination Scraper

Unit Testing Example

import unittest
from unittest.mock import Mock, patch

class TestPaginationScraper(unittest.TestCase):
    def setUp(self):
        self.scraper = PaginationScraper("https://example.com")

    def test_has_next_page_returns_true_when_button_exists(self):
        # Mock the driver to return a next button
        mock_button = Mock()
        mock_button.is_enabled.return_value = True
        mock_button.is_displayed.return_value = True

        with patch.object(self.scraper.driver, 'find_element', return_value=mock_button):
            result = self.scraper.has_next_page()
            self.assertTrue(result)

    def test_has_next_page_returns_false_when_button_disabled(self):
        # Mock the driver to return a disabled button
        mock_button = Mock()
        mock_button.is_enabled.return_value = False
        mock_button.is_displayed.return_value = True

        with patch.object(self.scraper.driver, 'find_element', return_value=mock_button):
            result = self.scraper.has_next_page()
            self.assertFalse(result)

    def tearDown(self):
        self.scraper.close()

if __name__ == '__main__':
    unittest.main()

Conclusion

Scraping data from multiple pages using Selenium pagination requires careful planning and robust error handling. The key is to understand the specific pagination pattern used by your target website and implement appropriate waiting strategies for dynamic content loading.

Remember to always respect the website's robots.txt file and implement reasonable delays between requests to avoid overwhelming the server. For more complex scenarios involving parallel processing, consider using dedicated web scraping solutions that can handle large-scale operations more efficiently.

By following these patterns and best practices, you'll be able to effectively scrape data across multiple pages while maintaining reliability and performance in your web scraping projects.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon