How to Handle Pagination When Scraping Multiple Pages with Mechanize
Pagination is one of the most common challenges in web scraping, as many websites split their content across multiple pages to improve loading times and user experience. Ruby's Mechanize library provides excellent tools for handling various pagination patterns automatically and efficiently.
Understanding Pagination Types
Before diving into implementation, it's important to understand the different types of pagination you'll encounter:
1. Link-Based Pagination
This is the most common pattern where "Next" or "Page 2" links are provided:
<a href="/page/2">Next</a>
<a href="/products?page=3">Page 3</a>
2. URL Pattern Pagination
Pages follow a predictable URL structure:
https://example.com/products?page=1
https://example.com/products?page=2
https://example.com/products?page=3
3. Form-Based Pagination
Pagination is controlled through form submissions or POST requests.
Basic Pagination Handling with Mechanize
Here's a fundamental approach to handle link-based pagination:
require 'mechanize'
agent = Mechanize.new
page = agent.get('https://example.com/products')
loop do
# Extract data from current page
page.search('.product').each do |product|
title = product.at('.title').text.strip
price = product.at('.price').text.strip
puts "#{title}: #{price}"
end
# Look for next page link
next_link = page.link_with(text: /next/i) || page.link_with(text: /→/)
break unless next_link
puts "Moving to next page..."
page = next_link.click
# Add delay to be respectful
sleep(1)
end
Advanced Pagination Patterns
Handling Multiple Next Link Variations
Different websites use various text patterns for pagination links:
def find_next_link(page)
# Try different common patterns
patterns = [
/next/i,
/more/i,
/continue/i,
/→/,
/>/,
/page\s*\d+/i
]
patterns.each do |pattern|
link = page.link_with(text: pattern)
return link if link
end
# Try finding by href pattern
page.links.find { |link| link.href =~ /page=\d+/ &&
link.href.match(/page=(\d+)/)[1].to_i > current_page_number(page) }
end
def current_page_number(page)
# Extract current page number from URL or page content
if page.uri.query =~ /page=(\d+)/
$1.to_i
else
1
end
end
URL Pattern-Based Pagination
When pagination follows a predictable URL pattern:
require 'mechanize'
agent = Mechanize.new
base_url = 'https://example.com/products'
page_num = 1
max_pages = 50 # Set a reasonable limit
loop do
url = "#{base_url}?page=#{page_num}"
begin
page = agent.get(url)
# Check if page has content (not a 404 or empty)
products = page.search('.product')
break if products.empty?
puts "Scraping page #{page_num}"
products.each do |product|
# Extract product data
title = product.at('.title')&.text&.strip
price = product.at('.price')&.text&.strip
next unless title && price
puts "#{title}: #{price}"
end
page_num += 1
break if page_num > max_pages
sleep(1) # Rate limiting
rescue Mechanize::ResponseCodeError => e
puts "Page #{page_num} not found: #{e.message}"
break
end
end
Form-Based Pagination
Some sites use forms for pagination control:
require 'mechanize'
agent = Mechanize.new
page = agent.get('https://example.com/search')
# Fill initial search form if needed
search_form = page.form_with(name: 'search')
if search_form
search_form['query'] = 'your search term'
page = agent.submit(search_form)
end
loop do
# Extract data from current page
extract_data_from_page(page)
# Look for pagination form
pagination_form = page.forms.find do |form|
form.buttons.any? { |btn| btn.value =~ /next/i }
end
break unless pagination_form
# Find and click next button
next_button = pagination_form.buttons.find { |btn| btn.value =~ /next/i }
break unless next_button
puts "Submitting pagination form..."
page = agent.submit(pagination_form, next_button)
sleep(1)
end
Robust Error Handling and Rate Limiting
Professional pagination handling requires proper error management:
require 'mechanize'
class PaginationScraper
def initialize(start_url)
@agent = Mechanize.new
@agent.user_agent_alias = 'Windows Mozilla'
@start_url = start_url
@max_retries = 3
@delay = 1
end
def scrape_all_pages
page = @agent.get(@start_url)
page_count = 0
loop do
page_count += 1
puts "Processing page #{page_count}"
begin
extract_data_from_page(page)
# Find next page with retry logic
next_page = find_next_page_with_retry(page)
break unless next_page
page = next_page
rate_limit_delay
rescue StandardError => e
puts "Error on page #{page_count}: #{e.message}"
break if page_count > 100 # Safety limit
end
end
puts "Scraped #{page_count} pages total"
end
private
def find_next_page_with_retry(page)
@max_retries.times do |attempt|
begin
return find_next_page(page)
rescue Mechanize::ResponseCodeError => e
puts "Attempt #{attempt + 1} failed: #{e.message}"
sleep(@delay * (attempt + 1)) # Exponential backoff
end
end
nil
end
def find_next_page(page)
# Multiple strategies for finding next page
next_link = page.link_with(text: /next/i) ||
page.link_with(href: /page=#{current_page_number(page) + 1}/) ||
page.at('a[rel="next"]')
return next_link.click if next_link
nil
end
def extract_data_from_page(page)
page.search('.item').each do |item|
data = {
title: item.at('.title')&.text&.strip,
price: item.at('.price')&.text&.strip,
url: item.at('a')&.[]('href')
}
# Process data (save to database, CSV, etc.)
process_item(data) if data[:title]
end
end
def rate_limit_delay
sleep(@delay + rand(0.5)) # Add randomization
end
def current_page_number(page)
if page.uri.query =~ /page=(\d+)/
$1.to_i
else
1
end
end
def process_item(data)
puts "#{data[:title]} - #{data[:price]}"
# Add your data processing logic here
end
end
# Usage
scraper = PaginationScraper.new('https://example.com/products')
scraper.scrape_all_pages
Handling JavaScript-Heavy Pagination
When dealing with sites that load pagination via JavaScript, you might need to combine Mechanize with headless browser solutions. For complex JavaScript pagination scenarios, consider using Puppeteer for handling dynamic content or similar tools.
Performance Optimization Techniques
Concurrent Page Processing
For better performance, you can process multiple pages concurrently:
require 'mechanize'
require 'thread'
class ConcurrentPaginationScraper
def initialize(base_url, max_workers: 5)
@base_url = base_url
@max_workers = max_workers
@queue = Queue.new
@results = Queue.new
end
def scrape_with_workers
# Discover all page URLs first
discover_all_pages
# Create worker threads
workers = []
@max_workers.times do
workers << Thread.new { worker_thread }
end
# Wait for completion
workers.each(&:join)
# Process results
process_all_results
end
private
def discover_all_pages
agent = Mechanize.new
page = agent.get(@base_url)
page_urls = [@base_url]
# Collect all pagination URLs
while (next_link = page.link_with(text: /next/i))
page = next_link.click
page_urls << page.uri.to_s
end
page_urls.each { |url| @queue << url }
@max_workers.times { @queue << :stop }
end
def worker_thread
agent = Mechanize.new
while (url = @queue.pop) != :stop
begin
page = agent.get(url)
data = extract_data_from_page(page)
@results << { url: url, data: data }
sleep(0.5) # Rate limiting per worker
rescue => e
@results << { url: url, error: e.message }
end
end
end
end
Best Practices for Pagination Scraping
1. Implement Proper Rate Limiting
Always add delays between requests to avoid overwhelming the server:
# Variable delay with randomization
def smart_delay
base_delay = 1
random_factor = rand(0.5..1.5)
sleep(base_delay * random_factor)
end
2. Set Reasonable Limits
Prevent infinite loops with safety mechanisms:
MAX_PAGES = 1000
page_count = 0
loop do
break if page_count >= MAX_PAGES
# Pagination logic here
page_count += 1
end
3. Handle Edge Cases
Account for various pagination implementations:
def safe_pagination(page)
# Check for disabled next buttons
next_button = page.at('a.next:not(.disabled)')
return nil if next_button.nil?
# Verify the link is actually different
current_url = page.uri.to_s
next_url = next_button['href']
return nil if current_url == next_url
next_button
end
Troubleshooting Common Issues
Session Management
Some sites require maintaining sessions across pagination:
agent = Mechanize.new
agent.cookie_jar.clear! # Start fresh
# Login if required
login_page = agent.get('https://example.com/login')
# Handle login...
# Now pagination with maintained session
page = agent.get('https://example.com/protected/data')
# Continue with pagination...
Handling Dynamic URLs
For sites with complex URL structures:
def normalize_pagination_url(base_url, page_num)
uri = URI.parse(base_url)
params = URI.decode_www_form(uri.query || '')
# Update or add page parameter
params.delete_if { |key, _| key == 'page' }
params << ['page', page_num.to_s]
uri.query = URI.encode_www_form(params)
uri.to_s
end
Conclusion
Effective pagination handling with Mechanize requires understanding the specific patterns used by your target website and implementing robust error handling and rate limiting. Whether dealing with simple link-based navigation or complex form submissions, the key is to build flexible, maintainable scrapers that can adapt to different pagination implementations.
For sites with heavy JavaScript requirements, consider integrating Mechanize with browser automation tools for dynamic content handling to ensure comprehensive data extraction across all pagination scenarios.