Table of contents

How do I handle pagination when scraping multiple pages with Beautiful Soup?

Pagination is one of the most common challenges when scraping websites that display large datasets across multiple pages. Beautiful Soup, combined with the requests library, provides excellent tools for handling various pagination patterns. This guide covers different pagination strategies and provides practical code examples to help you efficiently scrape paginated content.

Understanding Pagination Patterns

Before diving into the code, it's essential to understand the common pagination patterns you'll encounter:

  1. Numbered pagination - Pages with sequential numbers (1, 2, 3...)
  2. Next/Previous buttons - Links to navigate between pages
  3. Load more buttons - AJAX-based pagination
  4. Offset-based pagination - URL parameters like ?page=1&limit=20
  5. Infinite scroll - Content loads as you scroll (requires dynamic scraping tools)

Basic Pagination Setup

Here's a foundational setup for scraping paginated content with Beautiful Soup:

import requests
from bs4 import BeautifulSoup
import time
from urllib.parse import urljoin, urlparse

class PaginationScraper:
    def __init__(self, base_url, delay=1):
        self.base_url = base_url
        self.session = requests.Session()
        self.delay = delay

        # Set headers to avoid being blocked
        self.session.headers.update({
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        })

    def get_page(self, url):
        """Fetch and parse a single page"""
        try:
            response = self.session.get(url)
            response.raise_for_status()
            return BeautifulSoup(response.content, 'html.parser')
        except requests.RequestException as e:
            print(f"Error fetching {url}: {e}")
            return None

    def extract_data(self, soup):
        """Override this method to extract data from each page"""
        pass

    def scrape_all_pages(self):
        """Main method to scrape all pages"""
        pass

Method 1: Numbered Pagination

This is the most straightforward approach when pages follow a sequential numbering pattern:

class NumberedPaginationScraper(PaginationScraper):
    def __init__(self, base_url, start_page=1, max_pages=None, delay=1):
        super().__init__(base_url, delay)
        self.start_page = start_page
        self.max_pages = max_pages

    def extract_data(self, soup):
        """Extract product information from an e-commerce page"""
        products = []
        for product in soup.find_all('div', class_='product-item'):
            name = product.find('h3', class_='product-title')
            price = product.find('span', class_='price')

            if name and price:
                products.append({
                    'name': name.get_text(strip=True),
                    'price': price.get_text(strip=True)
                })
        return products

    def scrape_all_pages(self):
        """Scrape all numbered pages"""
        all_data = []
        page = self.start_page

        while True:
            # Construct URL for current page
            url = f"{self.base_url}?page={page}"
            print(f"Scraping page {page}: {url}")

            soup = self.get_page(url)
            if not soup:
                break

            # Extract data from current page
            page_data = self.extract_data(soup)

            # Stop if no data found (reached end)
            if not page_data:
                print("No more data found. Stopping.")
                break

            all_data.extend(page_data)

            # Check if we've reached max pages
            if self.max_pages and page >= self.max_pages:
                break

            page += 1
            time.sleep(self.delay)  # Be respectful

        return all_data

# Usage example
scraper = NumberedPaginationScraper(
    base_url="https://example-shop.com/products",
    max_pages=10,
    delay=2
)
products = scraper.scrape_all_pages()
print(f"Scraped {len(products)} products")

Method 2: Next Button Pagination

Many websites use "Next" buttons instead of numbered pages. Here's how to handle this pattern:

class NextButtonPaginationScraper(PaginationScraper):
    def extract_data(self, soup):
        """Extract article data from a news website"""
        articles = []
        for article in soup.find_all('article', class_='news-item'):
            title = article.find('h2', class_='article-title')
            author = article.find('span', class_='author')
            date = article.find('time', class_='publish-date')

            if title:
                articles.append({
                    'title': title.get_text(strip=True),
                    'author': author.get_text(strip=True) if author else None,
                    'date': date.get('datetime') if date else None
                })
        return articles

    def get_next_url(self, soup, current_url):
        """Find the next page URL"""
        next_link = soup.find('a', class_='next-page')
        if not next_link:
            # Try alternative selectors
            next_link = soup.find('a', text='Next') or soup.find('a', text='Next →')

        if next_link and next_link.get('href'):
            # Convert relative URL to absolute
            return urljoin(current_url, next_link['href'])
        return None

    def scrape_all_pages(self):
        """Scrape all pages following next button links"""
        all_data = []
        current_url = self.base_url
        page_count = 0

        while current_url:
            page_count += 1
            print(f"Scraping page {page_count}: {current_url}")

            soup = self.get_page(current_url)
            if not soup:
                break

            # Extract data from current page
            page_data = self.extract_data(soup)
            all_data.extend(page_data)

            # Find next page URL
            next_url = self.get_next_url(soup, current_url)

            if not next_url or next_url == current_url:
                print("No more pages found.")
                break

            current_url = next_url
            time.sleep(self.delay)

        return all_data

# Usage example
scraper = NextButtonPaginationScraper(
    base_url="https://news-site.com/articles",
    delay=1.5
)
articles = scraper.scrape_all_pages()
print(f"Scraped {len(articles)} articles")

Method 3: URL Parameter Pagination

Some websites use URL parameters for pagination. This method works well with offset-based pagination:

class ParameterPaginationScraper(PaginationScraper):
    def __init__(self, base_url, page_param='page', page_size=20, delay=1):
        super().__init__(base_url, delay)
        self.page_param = page_param
        self.page_size = page_size

    def build_url(self, page_number):
        """Build URL with pagination parameters"""
        separator = '&' if '?' in self.base_url else '?'
        return f"{self.base_url}{separator}{self.page_param}={page_number}"

    def has_more_pages(self, soup, current_page):
        """Determine if there are more pages to scrape"""
        # Method 1: Check for pagination info
        pagination_info = soup.find('div', class_='pagination-info')
        if pagination_info:
            text = pagination_info.get_text()
            # Look for patterns like "Page 1 of 50"
            import re
            match = re.search(r'(\d+)\s+of\s+(\d+)', text)
            if match:
                current, total = map(int, match.groups())
                return current < total

        # Method 2: Check if current page has fewer items than expected
        items = soup.find_all('div', class_='result-item')
        return len(items) >= self.page_size

    def extract_data(self, soup):
        """Extract search results"""
        results = []
        for item in soup.find_all('div', class_='result-item'):
            title = item.find('h3', class_='result-title')
            description = item.find('p', class_='result-description')

            if title:
                results.append({
                    'title': title.get_text(strip=True),
                    'description': description.get_text(strip=True) if description else None
                })
        return results

    def scrape_all_pages(self):
        """Scrape all pages using parameter-based pagination"""
        all_data = []
        page = 1

        while True:
            url = self.build_url(page)
            print(f"Scraping page {page}: {url}")

            soup = self.get_page(url)
            if not soup:
                break

            page_data = self.extract_data(soup)

            if not page_data:
                print("No data found on this page. Stopping.")
                break

            all_data.extend(page_data)

            # Check if there are more pages
            if not self.has_more_pages(soup, page):
                print("Reached the last page.")
                break

            page += 1
            time.sleep(self.delay)

        return all_data

Advanced Pagination Handling

Handling Dynamic Content

For websites that load content dynamically, you might need to combine Beautiful Soup with tools that can execute JavaScript. While Beautiful Soup itself can't handle JavaScript-rendered content, you can use it in conjunction with more advanced tools for handling complex navigation scenarios.

Error Handling and Retry Logic

import random
from time import sleep

class RobustPaginationScraper(PaginationScraper):
    def __init__(self, base_url, max_retries=3, delay=1):
        super().__init__(base_url, delay)
        self.max_retries = max_retries

    def get_page_with_retry(self, url):
        """Fetch page with retry logic"""
        for attempt in range(self.max_retries):
            try:
                soup = self.get_page(url)
                if soup:
                    return soup
            except Exception as e:
                print(f"Attempt {attempt + 1} failed for {url}: {e}")
                if attempt < self.max_retries - 1:
                    # Exponential backoff with jitter
                    delay = self.delay * (2 ** attempt) + random.uniform(0, 1)
                    sleep(delay)
        return None

    def scrape_with_progress(self, urls):
        """Scrape multiple URLs with progress tracking"""
        all_data = []
        total_urls = len(urls)

        for i, url in enumerate(urls, 1):
            print(f"Progress: {i}/{total_urls} ({i/total_urls*100:.1f}%)")

            soup = self.get_page_with_retry(url)
            if soup:
                data = self.extract_data(soup)
                all_data.extend(data)

            # Random delay to appear more human-like
            sleep(self.delay + random.uniform(0, 0.5))

        return all_data

Handling Different Response Formats

import re

def detect_pagination_type(soup):
    """Automatically detect pagination type"""
    # Check for numbered pagination
    if soup.find_all('a', href=re.compile(r'page=\d+')):
        return 'numbered'

    # Check for next/previous buttons
    if soup.find('a', class_=re.compile(r'next|continue')):
        return 'next_button'

    # Check for load more button
    if soup.find('button', class_=re.compile(r'load.?more')):
        return 'load_more'

    return 'unknown'

def smart_pagination_scraper(base_url):
    """Automatically adapt to different pagination types"""
    initial_response = requests.get(base_url)
    soup = BeautifulSoup(initial_response.content, 'html.parser')

    pagination_type = detect_pagination_type(soup)

    if pagination_type == 'numbered':
        return NumberedPaginationScraper(base_url)
    elif pagination_type == 'next_button':
        return NextButtonPaginationScraper(base_url)
    else:
        return ParameterPaginationScraper(base_url)

Best Practices for Pagination Scraping

Rate Limiting and Respectful Scraping

import time
from datetime import datetime, timedelta

class RateLimitedScraper(PaginationScraper):
    def __init__(self, base_url, requests_per_minute=30):
        super().__init__(base_url)
        self.requests_per_minute = requests_per_minute
        self.request_times = []

    def enforce_rate_limit(self):
        """Ensure we don't exceed rate limits"""
        now = datetime.now()

        # Remove requests older than 1 minute
        self.request_times = [
            req_time for req_time in self.request_times 
            if now - req_time < timedelta(minutes=1)
        ]

        # If we're at the limit, wait
        if len(self.request_times) >= self.requests_per_minute:
            sleep_time = 60 - (now - self.request_times[0]).total_seconds()
            if sleep_time > 0:
                print(f"Rate limit reached. Sleeping for {sleep_time:.1f} seconds.")
                time.sleep(sleep_time)

        self.request_times.append(now)

Session Management and Cookies

class SessionAwareScraper(PaginationScraper):
    def __init__(self, base_url, login_required=False):
        super().__init__(base_url)
        self.login_required = login_required

        if login_required:
            self.login()

    def login(self):
        """Handle login if required"""
        login_url = urljoin(self.base_url, '/login')

        # Get login page to extract CSRF token
        login_page = self.session.get(login_url)
        soup = BeautifulSoup(login_page.content, 'html.parser')

        csrf_token = soup.find('input', {'name': 'csrf_token'})
        if csrf_token:
            token_value = csrf_token.get('value')

        # Perform login
        login_data = {
            'username': 'your_username',
            'password': 'your_password',
            'csrf_token': token_value
        }

        response = self.session.post(login_url, data=login_data)
        if response.status_code == 200:
            print("Successfully logged in")
        else:
            print("Login failed")

JavaScript-Based Pagination

For websites that use JavaScript to load paginated content, Beautiful Soup alone isn't sufficient. You'll need to combine it with browser automation tools:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

class JavaScriptPaginationScraper:
    def __init__(self, base_url, headless=True):
        self.base_url = base_url
        options = webdriver.ChromeOptions()
        if headless:
            options.add_argument('--headless')
        self.driver = webdriver.Chrome(options=options)

    def scrape_infinite_scroll(self):
        """Handle infinite scroll pagination"""
        self.driver.get(self.base_url)
        all_data = []

        last_height = self.driver.execute_script("return document.body.scrollHeight")

        while True:
            # Scroll to bottom
            self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

            # Wait for new content to load
            time.sleep(2)

            # Calculate new scroll height
            new_height = self.driver.execute_script("return document.body.scrollHeight")

            if new_height == last_height:
                break  # No more content

            last_height = new_height

            # Extract data from current page state
            soup = BeautifulSoup(self.driver.page_source, 'html.parser')
            page_data = self.extract_data(soup)
            all_data.extend(page_data)

        self.driver.quit()
        return all_data

    def scrape_load_more_button(self):
        """Handle 'Load More' button pagination"""
        self.driver.get(self.base_url)
        all_data = []

        while True:
            # Extract current page data
            soup = BeautifulSoup(self.driver.page_source, 'html.parser')
            page_data = self.extract_data(soup)
            all_data.extend(page_data)

            try:
                # Find and click "Load More" button
                load_more_button = WebDriverWait(self.driver, 10).until(
                    EC.element_to_be_clickable((By.CLASS_NAME, "load-more-btn"))
                )
                load_more_button.click()

                # Wait for content to load
                time.sleep(2)

            except:
                print("No more 'Load More' button found.")
                break

        self.driver.quit()
        return all_data

Data Storage and Processing

Saving Scraped Data

import json
import csv
from datetime import datetime

class DataProcessor:
    @staticmethod
    def save_to_json(data, filename=None):
        """Save data to JSON file"""
        if not filename:
            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
            filename = f"scraped_data_{timestamp}.json"

        with open(filename, 'w', encoding='utf-8') as f:
            json.dump(data, f, indent=2, ensure_ascii=False)
        print(f"Data saved to {filename}")

    @staticmethod
    def save_to_csv(data, filename=None):
        """Save data to CSV file"""
        if not data:
            return

        if not filename:
            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
            filename = f"scraped_data_{timestamp}.csv"

        with open(filename, 'w', newline='', encoding='utf-8') as f:
            writer = csv.DictWriter(f, fieldnames=data[0].keys())
            writer.writeheader()
            writer.writerows(data)
        print(f"Data saved to {filename}")

    @staticmethod
    def deduplicate_data(data, key_field='title'):
        """Remove duplicate entries based on a key field"""
        seen = set()
        unique_data = []

        for item in data:
            if item.get(key_field) not in seen:
                seen.add(item.get(key_field))
                unique_data.append(item)

        print(f"Removed {len(data) - len(unique_data)} duplicates")
        return unique_data

# Usage example
scraper = NumberedPaginationScraper("https://example.com/products")
products = scraper.scrape_all_pages()

# Process and save data
processor = DataProcessor()
unique_products = processor.deduplicate_data(products, 'name')
processor.save_to_json(unique_products)
processor.save_to_csv(unique_products)

Monitoring and Logging

import logging
from datetime import datetime

class LoggingPaginationScraper(PaginationScraper):
    def __init__(self, base_url, delay=1):
        super().__init__(base_url, delay)
        self.setup_logging()

    def setup_logging(self):
        """Configure logging for the scraper"""
        logging.basicConfig(
            level=logging.INFO,
            format='%(asctime)s - %(levelname)s - %(message)s',
            handlers=[
                logging.FileHandler(f'scraper_{datetime.now().strftime("%Y%m%d")}.log'),
                logging.StreamHandler()
            ]
        )
        self.logger = logging.getLogger(__name__)

    def get_page(self, url):
        """Fetch page with logging"""
        self.logger.info(f"Fetching: {url}")
        try:
            response = self.session.get(url)
            response.raise_for_status()
            self.logger.info(f"Successfully fetched: {url} (Status: {response.status_code})")
            return BeautifulSoup(response.content, 'html.parser')
        except requests.RequestException as e:
            self.logger.error(f"Error fetching {url}: {e}")
            return None

    def scrape_all_pages(self):
        """Scrape with comprehensive logging"""
        start_time = datetime.now()
        self.logger.info(f"Starting pagination scrape of {self.base_url}")

        all_data = []
        page = 1

        try:
            while True:
                url = f"{self.base_url}?page={page}"
                soup = self.get_page(url)

                if not soup:
                    break

                page_data = self.extract_data(soup)

                if not page_data:
                    self.logger.info("No more data found. Stopping pagination.")
                    break

                all_data.extend(page_data)
                self.logger.info(f"Page {page}: Extracted {len(page_data)} items")

                page += 1
                time.sleep(self.delay)

        except KeyboardInterrupt:
            self.logger.info("Scraping interrupted by user")
        except Exception as e:
            self.logger.error(f"Unexpected error during scraping: {e}")

        end_time = datetime.now()
        duration = end_time - start_time

        self.logger.info(f"Scraping completed. Total items: {len(all_data)}, Duration: {duration}")
        return all_data

Conclusion

Handling pagination with Beautiful Soup requires understanding the specific pagination pattern used by your target website and implementing the appropriate scraping strategy. The key principles include:

  1. Identify the pagination pattern before writing your scraper
  2. Implement proper error handling and retry logic
  3. Respect rate limits and add delays between requests
  4. Handle session management when required
  5. Monitor for changes in the website structure

For more complex scenarios involving JavaScript-heavy websites, consider combining Beautiful Soup with headless browsers for comprehensive web scraping solutions. Remember to always check the website's robots.txt file and terms of service before scraping.

By following these patterns and best practices, you'll be able to efficiently scrape paginated content while maintaining good relationships with the websites you're accessing.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon