How do I handle pagination when scraping multiple pages with HTTParty?
Pagination is one of the most common challenges when scraping websites that display data across multiple pages. HTTParty, Ruby's popular HTTP client library, provides excellent tools for handling various pagination patterns. This guide covers different pagination strategies and implementation techniques for efficient multi-page scraping.
Understanding Pagination Patterns
Before diving into implementation, it's important to understand the common pagination patterns you'll encounter:
- URL-based pagination - Page numbers or offsets in the URL
- Next/Previous links - HTML links to navigate between pages
- API-based pagination - JSON responses with pagination metadata
- Infinite scroll - Dynamic content loading (requires JavaScript execution)
Basic Pagination Setup
First, let's establish a basic HTTParty class structure for pagination:
require 'httparty'
require 'nokogiri'
class PaginatedScraper
include HTTParty
base_uri 'https://example.com'
headers 'User-Agent' => 'Mozilla/5.0 (compatible; Ruby HTTParty scraper)'
def initialize
@results = []
@delay = 1 # Respectful delay between requests
end
private
def sleep_between_requests
sleep(@delay)
end
end
URL-Based Pagination
Sequential Page Numbers
The most straightforward pagination pattern uses page numbers in the URL:
class SequentialPagination < PaginatedScraper
def scrape_all_pages(base_url, max_pages = 50)
page = 1
loop do
url = "#{base_url}?page=#{page}"
response = self.class.get(url)
break unless response.success?
data = extract_data(response)
break if data.empty? # No more data
@results.concat(data)
puts "Scraped page #{page}: #{data.length} items"
page += 1
break if page > max_pages
sleep_between_requests
end
@results
end
private
def extract_data(response)
doc = Nokogiri::HTML(response.body)
# Extract your specific data here
doc.css('.item').map do |item|
{
title: item.css('.title').text.strip,
url: item.css('a')&.first&.[]('href')
}
end
end
end
# Usage
scraper = SequentialPagination.new
results = scraper.scrape_all_pages('https://example.com/products')
Offset-Based Pagination
Some websites use offset and limit parameters:
class OffsetPagination < PaginatedScraper
def scrape_with_offset(base_url, limit = 20, max_items = 1000)
offset = 0
loop do
url = "#{base_url}?limit=#{limit}&offset=#{offset}"
response = self.class.get(url)
break unless response.success?
data = extract_data(response)
break if data.empty?
@results.concat(data)
puts "Scraped #{@results.length} total items"
offset += limit
break if @results.length >= max_items
sleep_between_requests
end
@results.first(max_items)
end
end
Following Next Links
Many websites provide "Next" links in their HTML. This approach is more reliable than URL manipulation:
class NextLinkPagination < PaginatedScraper
def scrape_following_links(start_url)
current_url = start_url
loop do
response = self.class.get(current_url)
break unless response.success?
doc = Nokogiri::HTML(response.body)
# Extract data from current page
data = extract_data(response)
break if data.empty?
@results.concat(data)
puts "Scraped page: #{@results.length} total items"
# Find next page link
next_link = doc.css('a[rel="next"]').first ||
doc.css('.pagination .next').first ||
doc.css('a:contains("Next")').first
break unless next_link
current_url = resolve_url(next_link['href'], current_url)
sleep_between_requests
end
@results
end
private
def resolve_url(href, base_url)
if href.start_with?('http')
href
else
URI.join(base_url, href).to_s
end
end
end
API Pagination with JSON Responses
When scraping APIs that return JSON with pagination metadata:
class APIPagination < PaginatedScraper
def scrape_api_pages(api_endpoint, params = {})
page = 1
loop do
current_params = params.merge(page: page)
response = self.class.get(api_endpoint, query: current_params)
break unless response.success?
json_data = JSON.parse(response.body)
# Extract items from API response
items = json_data['data'] || json_data['items'] || []
break if items.empty?
@results.concat(items)
puts "API page #{page}: #{items.length} items"
# Check pagination metadata
pagination = json_data['pagination'] || json_data['meta']
break unless pagination&.dig('has_more') ||
pagination&.dig('current_page') < pagination&.dig('total_pages')
page += 1
sleep_between_requests
end
@results
end
end
# Usage with query parameters
scraper = APIPagination.new
results = scraper.scrape_api_pages(
'https://api.example.com/products',
{ category: 'electronics', per_page: 50 }
)
Advanced Pagination Techniques
Cursor-Based Pagination
Some modern APIs use cursor-based pagination for better performance:
class CursorPagination < PaginatedScraper
def scrape_with_cursor(api_endpoint, params = {})
cursor = nil
loop do
current_params = params.dup
current_params[:cursor] = cursor if cursor
response = self.class.get(api_endpoint, query: current_params)
break unless response.success?
json_data = JSON.parse(response.body)
items = json_data['data'] || []
break if items.empty?
@results.concat(items)
# Get next cursor
cursor = json_data.dig('pagination', 'next_cursor')
break unless cursor
sleep_between_requests
end
@results
end
end
Handling Dynamic Parameters
Sometimes pagination URLs contain dynamic tokens or require form data:
class DynamicPagination < PaginatedScraper
def scrape_with_dynamic_params(start_url)
# Get initial page to extract pagination parameters
response = self.class.get(start_url)
doc = Nokogiri::HTML(response.body)
# Extract dynamic tokens (CSRF tokens, session IDs, etc.)
csrf_token = doc.css('input[name="csrf_token"]').first&.[]('value')
session_id = doc.css('input[name="session_id"]').first&.[]('value')
page = 1
loop do
form_data = {
page: page,
csrf_token: csrf_token,
session_id: session_id
}
response = self.class.post('/search', body: form_data)
break unless response.success?
data = extract_data(response)
break if data.empty?
@results.concat(data)
page += 1
sleep_between_requests
end
@results
end
end
Error Handling and Resilience
Robust pagination scraping requires proper error handling:
class ResilientPagination < PaginatedScraper
MAX_RETRIES = 3
RETRY_DELAY = 5
def scrape_with_retry(urls)
urls.each_with_index do |url, index|
retries = 0
begin
response = self.class.get(url)
if response.success?
data = extract_data(response)
@results.concat(data)
puts "Processed #{index + 1}/#{urls.length}: #{data.length} items"
else
raise "HTTP #{response.code}: #{response.message}"
end
rescue => e
retries += 1
if retries <= MAX_RETRIES
puts "Error on #{url}: #{e.message}. Retry #{retries}/#{MAX_RETRIES}"
sleep(RETRY_DELAY * retries) # Exponential backoff
retry
else
puts "Failed permanently: #{url} - #{e.message}"
end
end
sleep_between_requests
end
@results
end
end
Performance Optimization
Concurrent Requests
For faster scraping, you can process multiple pages concurrently:
require 'concurrent-ruby'
class ConcurrentPagination < PaginatedScraper
def scrape_concurrently(urls, max_threads = 5)
results = Concurrent::Array.new
# Create thread pool
pool = Concurrent::ThreadPoolExecutor.new(
min_threads: 2,
max_threads: max_threads,
max_queue: urls.length
)
futures = urls.map do |url|
Concurrent::Future.execute(executor: pool) do
response = self.class.get(url)
if response.success?
extract_data(response)
else
[]
end
end
end
# Wait for all requests to complete
futures.each { |future| results.concat(future.value || []) }
pool.shutdown
pool.wait_for_termination
results.to_a
end
end
Monitoring and Rate Limiting
When scraping multiple pages, it's crucial to implement proper rate limiting and monitoring. For more complex scenarios involving dynamic content or JavaScript-heavy pagination, consider using browser automation tools like Puppeteer for handling dynamic pagination patterns.
Rate Limiting Implementation
class RateLimitedPagination < PaginatedScraper
def initialize(requests_per_second = 1)
super()
@rate_limiter = Concurrent::TimerTask.new(execution_interval: 1.0 / requests_per_second) do
@semaphore.release if @semaphore.available_permits == 0
end
@semaphore = Concurrent::Semaphore.new(1)
@rate_limiter.execute
end
def scrape_with_rate_limit(urls)
urls.each do |url|
@semaphore.acquire
response = self.class.get(url)
if response.success?
data = extract_data(response)
@results.concat(data)
end
end
@rate_limiter.shutdown
@results
end
end
Best Practices for Pagination Scraping
- Always check robots.txt before scraping
- Implement respectful delays between requests
- Use appropriate User-Agent headers
- Handle errors gracefully with retry logic
- Monitor memory usage for large datasets
- Save progress periodically for long-running scrapes
- Validate data integrity across pages
For scenarios where pagination involves complex user interactions or AJAX requests, you might need browser automation tools that can handle dynamic content loading.
Testing Your Pagination Logic
require 'rspec'
RSpec.describe PaginatedScraper do
let(:scraper) { described_class.new }
it 'handles empty pages gracefully' do
allow(HTTParty).to receive(:get).and_return(
double(success?: true, body: '<html></html>')
)
results = scraper.scrape_all_pages('http://test.com')
expect(results).to be_empty
end
it 'stops on HTTP errors' do
allow(HTTParty).to receive(:get).and_return(
double(success?: false, code: 404)
)
results = scraper.scrape_all_pages('http://test.com')
expect(results).to be_empty
end
end
HTTParty provides excellent flexibility for handling various pagination patterns. The key is identifying the specific pagination mechanism used by your target website and implementing appropriate logic with proper error handling and rate limiting. Remember to always respect the website's terms of service and implement responsible scraping practices.