How do I handle pagination with MechanicalSoup?
Pagination is a common challenge when scraping websites that display content across multiple pages. MechanicalSoup, a Python library that combines the power of Requests and Beautiful Soup, provides excellent tools for handling various pagination patterns. This guide covers different pagination strategies and implementation techniques.
Understanding Pagination Types
Before diving into MechanicalSoup-specific solutions, it's important to understand the different types of pagination you might encounter:
- Numbered pagination - Traditional page numbers (1, 2, 3...)
- Next/Previous links - Simple navigation buttons
- Load more buttons - AJAX-style pagination
- URL parameter pagination - Pages controlled by URL parameters
Basic Setup
First, ensure you have MechanicalSoup installed and set up a basic browser instance:
import mechanicalsoup
import time
from urllib.parse import urljoin
# Create a browser instance
browser = mechanicalsoup.StatefulBrowser()
browser.set_user_agent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36')
# Optional: Enable debugging
browser.set_debug(True)
Handling Numbered Pagination
This is the most common pagination pattern where pages are accessed through numbered links or URL parameters.
Method 1: Using Next Page Links
def scrape_with_next_links(base_url):
browser = mechanicalsoup.StatefulBrowser()
browser.open(base_url)
page_number = 1
all_data = []
while True:
print(f"Scraping page {page_number}...")
# Extract data from current page
soup = browser.get_current_page()
data = extract_page_data(soup)
all_data.extend(data)
# Look for "Next" button or link
next_link = soup.find('a', {'class': 'next-page'}) # Adjust selector
if not next_link or not next_link.get('href'):
print("No more pages found")
break
# Navigate to next page
try:
browser.follow_link(next_link)
page_number += 1
time.sleep(1) # Be respectful with delays
except Exception as e:
print(f"Error navigating to next page: {e}")
break
return all_data
def extract_page_data(soup):
"""Extract data from the current page"""
data = []
# Adjust selectors based on your target website
items = soup.find_all('div', {'class': 'item'})
for item in items:
title = item.find('h2')
description = item.find('p')
if title and description:
data.append({
'title': title.get_text(strip=True),
'description': description.get_text(strip=True)
})
return data
Method 2: URL Parameter Pagination
For sites that use URL parameters like ?page=1
, ?page=2
:
def scrape_url_pagination(base_url, max_pages=None):
browser = mechanicalsoup.StatefulBrowser()
page = 1
all_data = []
while True:
# Construct URL with page parameter
url = f"{base_url}?page={page}"
print(f"Scraping {url}...")
try:
response = browser.open(url)
# Check if page exists (status code, content, etc.)
if response.status_code != 200:
print(f"Page {page} returned status {response.status_code}")
break
soup = browser.get_current_page()
# Check if page has content
if not has_content(soup):
print(f"Page {page} has no content")
break
# Extract data
data = extract_page_data(soup)
if not data:
print(f"No data found on page {page}")
break
all_data.extend(data)
page += 1
# Optional: limit maximum pages
if max_pages and page > max_pages:
break
time.sleep(1) # Rate limiting
except Exception as e:
print(f"Error scraping page {page}: {e}")
break
return all_data
def has_content(soup):
"""Check if the page has actual content (not an error page)"""
# Adjust based on your target site's structure
items = soup.find_all('div', {'class': 'item'})
return len(items) > 0
Handling Form-Based Pagination
Some sites use forms with hidden fields or buttons for pagination:
def scrape_form_pagination(base_url):
browser = mechanicalsoup.StatefulBrowser()
browser.open(base_url)
page_number = 1
all_data = []
while True:
print(f"Scraping page {page_number}...")
# Extract data from current page
soup = browser.get_current_page()
data = extract_page_data(soup)
all_data.extend(data)
# Look for pagination form
form = browser.select_form('form[name="pagination"]') # Adjust selector
if not form:
print("No pagination form found")
break
# Check if there's a next page button
try:
# Try to find and click the next button
next_button = soup.find('input', {'name': 'next', 'type': 'submit'})
if not next_button:
print("No next button found")
break
# Submit the form to go to next page
response = browser.submit_selected()
if response.status_code != 200:
print(f"Form submission failed with status {response.status_code}")
break
page_number += 1
time.sleep(1)
except Exception as e:
print(f"Error with form pagination: {e}")
break
return all_data
Advanced Pagination Handling
Detecting Pagination Patterns Automatically
def detect_and_scrape_pagination(base_url):
browser = mechanicalsoup.StatefulBrowser()
browser.open(base_url)
soup = browser.get_current_page()
# Detect pagination type
if soup.find('a', string=lambda text: text and 'next' in text.lower()):
print("Detected next/previous link pagination")
return scrape_with_next_links(base_url)
elif soup.find('form', {'name': 'pagination'}):
print("Detected form-based pagination")
return scrape_form_pagination(base_url)
else:
print("Attempting URL parameter pagination")
return scrape_url_pagination(base_url)
Handling AJAX Pagination
For sites with AJAX-based "Load More" buttons, you might need to combine MechanicalSoup with other tools or make direct API calls:
import requests
import json
def scrape_ajax_pagination(base_url, ajax_endpoint):
browser = mechanicalsoup.StatefulBrowser()
browser.open(base_url)
# Get initial page data
soup = browser.get_current_page()
all_data = extract_page_data(soup)
# Extract session cookies and headers
session = browser.session
page = 2
while True:
# Make AJAX request for more data
ajax_data = {
'page': page,
'action': 'load_more' # Adjust based on site requirements
}
try:
response = session.post(ajax_endpoint, data=ajax_data)
if response.status_code != 200:
break
json_data = response.json()
# Check if there's more data
if not json_data.get('has_more', False):
break
# Process the returned HTML or data
if 'html' in json_data:
from bs4 import BeautifulSoup
ajax_soup = BeautifulSoup(json_data['html'], 'html.parser')
page_data = extract_page_data(ajax_soup)
all_data.extend(page_data)
page += 1
time.sleep(1)
except Exception as e:
print(f"AJAX pagination error: {e}")
break
return all_data
Error Handling and Best Practices
Robust Pagination with Error Recovery
def robust_pagination_scraper(base_url, max_retries=3):
browser = mechanicalsoup.StatefulBrowser()
browser.set_user_agent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36')
page = 1
all_data = []
consecutive_failures = 0
while consecutive_failures < max_retries:
try:
url = f"{base_url}?page={page}"
print(f"Attempting to scrape page {page}...")
response = browser.open(url)
if response.status_code == 404:
print(f"Page {page} not found (404)")
break
elif response.status_code != 200:
raise Exception(f"HTTP {response.status_code}")
soup = browser.get_current_page()
data = extract_page_data(soup)
if not data:
consecutive_failures += 1
print(f"No data on page {page} (attempt {consecutive_failures})")
if consecutive_failures >= max_retries:
break
else:
all_data.extend(data)
consecutive_failures = 0 # Reset counter on success
page += 1
time.sleep(1)
except Exception as e:
consecutive_failures += 1
print(f"Error on page {page}: {e} (attempt {consecutive_failures})")
if consecutive_failures >= max_retries:
print(f"Max retries reached, stopping at page {page}")
break
time.sleep(2) # Wait longer on errors
return all_data
Implementing Rate Limiting
import random
from time import sleep
def scrape_with_rate_limiting(base_url, min_delay=1, max_delay=3):
browser = mechanicalsoup.StatefulBrowser()
page = 1
all_data = []
while True:
try:
url = f"{base_url}?page={page}"
browser.open(url)
soup = browser.get_current_page()
data = extract_page_data(soup)
if not data:
break
all_data.extend(data)
page += 1
# Random delay to appear more human-like
delay = random.uniform(min_delay, max_delay)
print(f"Waiting {delay:.2f} seconds before next page...")
sleep(delay)
except Exception as e:
print(f"Error: {e}")
break
return all_data
Tips for Successful Pagination
- Always inspect the website structure first to understand the pagination mechanism
- Use appropriate delays between requests to avoid being blocked
- Handle errors gracefully with retry logic and proper exception handling
- Respect robots.txt and website terms of service
- Monitor your scraping to detect when you've reached the end of available content
- Use session management to maintain cookies and authentication across pages
For more complex pagination scenarios involving JavaScript-heavy sites, you might want to consider using browser automation tools like Puppeteer or similar solutions that can handle dynamic content loading.
Conclusion
MechanicalSoup provides powerful tools for handling pagination in web scraping projects. Whether you're dealing with simple numbered pages, form-based navigation, or more complex pagination patterns, the key is to understand the underlying mechanism and implement robust error handling. Remember to always scrape responsibly and consider the impact on the target website's performance.
For additional guidance on handling complex web scraping scenarios, consider exploring browser automation techniques when dealing with JavaScript-heavy pagination systems.