How do I handle pagination when scraping multiple pages with Ruby?
Pagination is one of the most common challenges in web scraping, especially when dealing with large datasets spread across multiple pages. Whether you're scraping e-commerce listings, search results, or blog archives, handling pagination efficiently is crucial for successful data extraction. This comprehensive guide covers various pagination patterns and how to handle them using Ruby.
Understanding Common Pagination Patterns
Before diving into implementation, it's important to understand the different types of pagination you might encounter:
- Number-based pagination - Pages numbered 1, 2, 3, etc.
- Offset-based pagination - Using parameters like
offset
andlimit
- Cursor-based pagination - Using unique identifiers to navigate
- Next/Previous link pagination - Following "Next" buttons or links
- Infinite scroll pagination - Content loaded dynamically via AJAX
Basic Pagination with HTTParty and Nokogiri
Let's start with a simple example using HTTParty for HTTP requests and Nokogiri for HTML parsing:
require 'httparty'
require 'nokogiri'
class PaginationScraper
def initialize(base_url)
@base_url = base_url
@current_page = 1
@scraped_data = []
end
def scrape_all_pages
loop do
page_url = build_page_url(@current_page)
response = HTTParty.get(page_url)
break unless response.success?
doc = Nokogiri::HTML(response.body)
# Extract data from current page
page_data = extract_page_data(doc)
break if page_data.empty?
@scraped_data.concat(page_data)
# Check if there's a next page
break unless has_next_page?(doc)
@current_page += 1
# Be respectful - add delay between requests
sleep(1)
end
@scraped_data
end
private
def build_page_url(page_number)
"#{@base_url}?page=#{page_number}"
end
def extract_page_data(doc)
# Extract your specific data here
doc.css('.item').map do |item|
{
title: item.css('.title').text.strip,
price: item.css('.price').text.strip,
url: item.css('a')['href']
}
end
end
def has_next_page?(doc)
# Check if next page link exists
doc.css('.pagination .next').any?
end
end
# Usage
scraper = PaginationScraper.new('https://example.com/products')
all_data = scraper.scrape_all_pages
puts "Scraped #{all_data.length} items"
Advanced Pagination Handling with Mechanize
For more complex scenarios involving forms or session management, Mechanize provides a more robust solution:
require 'mechanize'
class AdvancedPaginationScraper
def initialize
@agent = Mechanize.new
@agent.user_agent_alias = 'Windows Chrome'
@scraped_items = []
end
def scrape_with_form_pagination(start_url)
page = @agent.get(start_url)
loop do
# Extract data from current page
items = extract_items_from_page(page)
break if items.empty?
@scraped_items.concat(items)
puts "Scraped page with #{items.length} items"
# Look for next page link or button
next_link = page.link_with(text: /next/i) ||
page.link_with(text: /more/i)
break unless next_link
# Follow the next page link
page = next_link.click
# Random delay to avoid being blocked
sleep(rand(1..3))
end
@scraped_items
end
def scrape_with_post_pagination(start_url, form_data = {})
page = @agent.get(start_url)
page_num = 1
loop do
# Extract data from current page
items = extract_items_from_page(page)
break if items.empty?
@scraped_items.concat(items)
# Find pagination form
form = page.form_with(action: /search/) || page.forms.first
break unless form
# Update form data for next page
form_data.each { |key, value| form[key] = value }
form['page'] = page_num + 1
# Submit form to get next page
begin
page = @agent.submit(form)
page_num += 1
rescue Mechanize::ResponseCodeError => e
puts "Error: #{e.message}"
break
end
sleep(rand(2..4))
end
@scraped_items
end
private
def extract_items_from_page(page)
page.search('.product-item').map do |item|
{
name: item.at('.product-name')&.text&.strip,
price: item.at('.price')&.text&.strip,
image: item.at('img')&.[]('src'),
link: item.at('a')&.[]('href')
}.compact
end
end
end
Handling AJAX Pagination
Many modern websites use AJAX for pagination. Here's how to handle it using Ruby with HTTP requests:
require 'httparty'
require 'json'
class AjaxPaginationScraper
include HTTParty
def initialize(base_url)
@base_url = base_url
@headers = {
'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept' => 'application/json, text/javascript, */*; q=0.01',
'X-Requested-With' => 'XMLHttpRequest'
}
end
def scrape_ajax_pagination(endpoint, initial_params = {})
all_data = []
page = 1
loop do
params = initial_params.merge(page: page, per_page: 20)
response = self.class.get(
"#{@base_url}/#{endpoint}",
query: params,
headers: @headers
)
break unless response.success?
data = JSON.parse(response.body)
items = data['items'] || data['results'] || []
break if items.empty?
all_data.concat(items)
# Check if we've reached the last page
if data['pagination']
total_pages = data['pagination']['total_pages']
break if page >= total_pages
else
# If no pagination info, check if items < per_page
break if items.length < 20
end
page += 1
sleep(1)
end
all_data
end
def scrape_infinite_scroll(endpoint, max_pages = nil)
all_data = []
offset = 0
limit = 50
pages_scraped = 0
loop do
break if max_pages && pages_scraped >= max_pages
params = { offset: offset, limit: limit }
response = self.class.get(
"#{@base_url}/#{endpoint}",
query: params,
headers: @headers
)
break unless response.success?
data = JSON.parse(response.body)
items = data['items'] || []
break if items.empty?
all_data.concat(items)
offset += limit
pages_scraped += 1
puts "Scraped page #{pages_scraped}, total items: #{all_data.length}"
sleep(rand(1..2))
end
all_data
end
end
Robust Error Handling and Retry Logic
When scraping paginated content, it's crucial to implement proper error handling:
require 'httparty'
class RobustPaginationScraper
MAX_RETRIES = 3
RETRY_DELAY = 2
def initialize(base_url)
@base_url = base_url
end
def scrape_with_retries(max_pages = nil)
all_data = []
page = 1
consecutive_failures = 0
max_consecutive_failures = 3
loop do
break if max_pages && page > max_pages
break if consecutive_failures >= max_consecutive_failures
begin
page_data = scrape_single_page(page)
if page_data.empty?
puts "No data found on page #{page}, stopping"
break
end
all_data.concat(page_data)
consecutive_failures = 0
puts "Successfully scraped page #{page}: #{page_data.length} items"
rescue StandardError => e
consecutive_failures += 1
puts "Error on page #{page}: #{e.message}"
if consecutive_failures < max_consecutive_failures
puts "Retrying page #{page} (attempt #{consecutive_failures})"
sleep(RETRY_DELAY * consecutive_failures)
next
else
puts "Max consecutive failures reached, stopping"
break
end
end
page += 1
sleep(rand(1..3))
end
all_data
end
private
def scrape_single_page(page_number)
retries = 0
begin
url = "#{@base_url}?page=#{page_number}"
response = HTTParty.get(url, timeout: 30)
raise "HTTP Error: #{response.code}" unless response.success?
doc = Nokogiri::HTML(response.body)
extract_data(doc)
rescue StandardError => e
retries += 1
if retries <= MAX_RETRIES
sleep(RETRY_DELAY * retries)
retry
else
raise e
end
end
end
def extract_data(doc)
# Your data extraction logic here
doc.css('.item').map do |item|
{
title: item.css('.title').text.strip,
description: item.css('.description').text.strip
}
end
end
end
Performance Optimization with Concurrent Processing
For large-scale scraping, you can implement concurrent processing:
require 'concurrent-ruby'
require 'httparty'
class ConcurrentPaginationScraper
def initialize(base_url, max_workers: 5)
@base_url = base_url
@max_workers = max_workers
end
def scrape_pages_concurrently(page_range)
executor = Concurrent::ThreadPoolExecutor.new(
min_threads: 1,
max_threads: @max_workers,
max_queue: 100
)
futures = page_range.map do |page_num|
Concurrent::Future.execute(executor: executor) do
scrape_page(page_num)
end
end
# Wait for all futures to complete and collect results
results = futures.map(&:value!).compact.flatten
executor.shutdown
executor.wait_for_termination
results
end
private
def scrape_page(page_number)
begin
url = "#{@base_url}?page=#{page_number}"
response = HTTParty.get(url, timeout: 15)
return [] unless response.success?
doc = Nokogiri::HTML(response.body)
items = extract_items(doc)
puts "Page #{page_number}: #{items.length} items"
items
rescue StandardError => e
puts "Error scraping page #{page_number}: #{e.message}"
[]
end
end
def extract_items(doc)
# Your extraction logic
doc.css('.item').map { |item| { text: item.text.strip } }
end
end
# Usage
scraper = ConcurrentPaginationScraper.new('https://example.com/data')
results = scraper.scrape_pages_concurrently(1..50)
Working with API Pagination
Many modern websites provide APIs with built-in pagination. Here's how to handle API pagination:
require 'httparty'
require 'json'
class ApiPaginationScraper
include HTTParty
def initialize(api_key = nil)
@headers = {
'Content-Type' => 'application/json',
'User-Agent' => 'Ruby Web Scraper'
}
@headers['Authorization'] = "Bearer #{api_key}" if api_key
end
def scrape_cursor_based_api(base_url, initial_cursor = nil)
all_data = []
cursor = initial_cursor
loop do
params = cursor ? { cursor: cursor, limit: 100 } : { limit: 100 }
response = self.class.get(base_url, query: params, headers: @headers)
break unless response.success?
data = JSON.parse(response.body)
items = data['data'] || data['results'] || []
break if items.empty?
all_data.concat(items)
# Get next cursor
cursor = data.dig('pagination', 'next_cursor') || data['next_cursor']
break unless cursor
puts "Fetched #{items.length} items, total: #{all_data.length}"
sleep(0.5) # Rate limiting
end
all_data
end
def scrape_offset_based_api(base_url, limit = 100)
all_data = []
offset = 0
loop do
params = { offset: offset, limit: limit }
response = self.class.get(base_url, query: params, headers: @headers)
break unless response.success?
data = JSON.parse(response.body)
items = data['data'] || data['results'] || []
break if items.empty?
all_data.concat(items)
# Check if we've reached the end
total = data['total'] || data['count']
if total && (offset + limit) >= total
break
end
offset += limit
puts "Fetched #{items.length} items, total: #{all_data.length}"
sleep(0.5)
end
all_data
end
end
Best Practices for Pagination Scraping
1. Respect Rate Limits
Always include delays between requests to avoid overwhelming the server:
# Random delays to appear more human-like
sleep(rand(1.0..3.0))
# Exponential backoff for errors
def exponential_backoff(attempt)
sleep(2 ** attempt)
end
2. Handle Different Response Formats
Some pagination might return different formats:
def parse_response(response)
content_type = response.headers['content-type']
case content_type
when /json/
JSON.parse(response.body)
when /xml/
Nokogiri::XML(response.body)
else
Nokogiri::HTML(response.body)
end
end
3. Implement Checkpointing
For long-running scrapes, save progress periodically:
require 'json'
def scrape_with_checkpoint(checkpoint_file = 'scraping_progress.json')
progress = load_checkpoint(checkpoint_file)
start_page = progress['last_page'] || 1
(start_page..total_pages).each do |page|
data = scrape_page(page)
save_data(data)
# Update checkpoint every 10 pages
if page % 10 == 0
save_checkpoint(checkpoint_file, page)
end
end
end
def load_checkpoint(file)
return {} unless File.exist?(file)
JSON.parse(File.read(file))
rescue JSON::ParserError
{}
end
def save_checkpoint(file, page)
checkpoint = { last_page: page, timestamp: Time.now.to_i }
File.write(file, JSON.pretty_generate(checkpoint))
end
4. Monitor and Log Progress
require 'logger'
class LoggedPaginationScraper
def initialize(base_url)
@base_url = base_url
@logger = Logger.new('scraping.log')
@logger.level = Logger::INFO
end
def scrape_with_logging
@logger.info("Starting pagination scraping for #{@base_url}")
page = 1
total_items = 0
loop do
@logger.info("Processing page #{page}")
begin
items = scrape_page(page)
break if items.empty?
total_items += items.length
@logger.info("Page #{page}: #{items.length} items (total: #{total_items})")
rescue StandardError => e
@logger.error("Error on page #{page}: #{e.message}")
break
end
page += 1
sleep(rand(1..2))
end
@logger.info("Scraping completed. Total items: #{total_items}")
total_items
end
end
Conclusion
Handling pagination in Ruby web scraping requires understanding the specific pagination mechanism used by your target website and implementing appropriate strategies for navigation, error handling, and performance optimization. The examples provided cover the most common scenarios you'll encounter, from simple numbered pagination to complex AJAX-based systems and API pagination.
Key takeaways for successful pagination handling:
- Always implement proper error handling and retry logic
- Respect rate limits with appropriate delays between requests
- Use checkpointing for long-running scrapes
- Consider concurrent processing for better performance
- Log your progress for debugging and monitoring
Remember to always respect the website's robots.txt file, implement proper rate limiting, and consider using web scraping APIs when dealing with complex, JavaScript-heavy sites that require more sophisticated handling than traditional HTTP clients can provide.
For scenarios involving heavy JavaScript rendering or complex user interactions during pagination, you might want to consider browser automation solutions that can handle dynamic content more effectively than traditional HTTP scraping methods.