How to Use Nokogiri to Scrape Data from Paginated Websites
Paginated websites split content across multiple pages to improve performance and user experience. When scraping such sites with Nokogiri, you need strategies to navigate through all pages systematically. This comprehensive guide shows you how to handle different pagination patterns using Ruby and Nokogiri.
Understanding Pagination Patterns
Before diving into code, it's essential to understand common pagination patterns:
- Numbered pagination - Pages with sequential numbers (1, 2, 3...)
- Next/Previous buttons - "Next" and "Previous" navigation links
- Load more buttons - AJAX-based infinite scroll or load more functionality
- Cursor-based pagination - Using tokens or cursors for navigation
Basic Setup for Nokogiri Pagination Scraping
First, let's set up the required dependencies for our pagination scraping project:
require 'nokogiri'
require 'open-uri'
require 'net/http'
require 'uri'
class PaginationScraper
def initialize(base_url)
@base_url = base_url
@scraped_data = []
end
private
def fetch_page(url)
uri = URI.parse(url)
response = Net::HTTP.get_response(uri)
if response.code == '200'
Nokogiri::HTML(response.body)
else
puts "Error fetching #{url}: #{response.code}"
nil
end
rescue => e
puts "Exception fetching #{url}: #{e.message}"
nil
end
end
Method 1: Numbered Pagination
This is the most straightforward pagination pattern where pages are numbered sequentially.
class NumberedPaginationScraper < PaginationScraper
def scrape_all_pages(max_pages = 100)
page_number = 1
loop do
break if page_number > max_pages
url = build_page_url(page_number)
doc = fetch_page(url)
break unless doc
# Extract data from current page
page_data = extract_data_from_page(doc)
# Stop if no data found (reached end)
break if page_data.empty?
@scraped_data.concat(page_data)
puts "Scraped page #{page_number} - found #{page_data.length} items"
page_number += 1
# Add delay to be respectful
sleep(1)
end
@scraped_data
end
private
def build_page_url(page_number)
"#{@base_url}?page=#{page_number}"
end
def extract_data_from_page(doc)
items = []
doc.css('.item').each do |item|
items << {
title: item.css('.title').text.strip,
description: item.css('.description').text.strip,
link: item.css('a')&.first&.[]('href')
}
end
items
end
end
# Usage
scraper = NumberedPaginationScraper.new('https://example.com/products')
all_data = scraper.scrape_all_pages
puts "Total items scraped: #{all_data.length}"
Method 2: Next Button Navigation
Many websites use "Next" buttons instead of numbered pages. Here's how to handle this pattern:
class NextButtonScraper < PaginationScraper
def scrape_with_next_button
current_url = @base_url
loop do
doc = fetch_page(current_url)
break unless doc
# Extract data from current page
page_data = extract_data_from_page(doc)
break if page_data.empty?
@scraped_data.concat(page_data)
puts "Scraped page - found #{page_data.length} items"
# Find next page URL
next_link = find_next_page_link(doc)
break unless next_link
current_url = resolve_url(next_link)
sleep(1)
end
@scraped_data
end
private
def find_next_page_link(doc)
# Common selectors for next buttons
next_selectors = [
'a[rel="next"]',
'.pagination .next a',
'.pager-next a',
'a:contains("Next")',
'a:contains("→")'
]
next_selectors.each do |selector|
next_link = doc.css(selector).first
return next_link['href'] if next_link
end
nil
end
def resolve_url(relative_url)
return relative_url if relative_url.start_with?('http')
base_uri = URI.parse(@base_url)
URI.join("#{base_uri.scheme}://#{base_uri.host}", relative_url).to_s
end
def extract_data_from_page(doc)
# Implementation similar to numbered pagination
items = []
doc.css('.product-item').each do |item|
items << {
name: item.css('.product-name').text.strip,
price: item.css('.price').text.strip,
image: item.css('img')&.first&.[]('src')
}
end
items
end
end
Method 3: Advanced Pagination with Session Management
For websites requiring authentication or session management:
require 'net/http'
require 'http-cookie'
class SessionBasedScraper
def initialize(base_url)
@base_url = base_url
@cookie_jar = HTTP::CookieJar.new
@scraped_data = []
end
def login(username, password)
login_url = "#{@base_url}/login"
# Get login form
doc = fetch_page_with_session(login_url)
csrf_token = doc.css('input[name="csrf_token"]')&.first&.[]('value')
# Submit login form
login_data = {
'username' => username,
'password' => password,
'csrf_token' => csrf_token
}
post_form(login_url, login_data)
end
def scrape_paginated_content
current_page = 1
loop do
url = "#{@base_url}/protected-content?page=#{current_page}"
doc = fetch_page_with_session(url)
break unless doc
page_data = extract_protected_data(doc)
break if page_data.empty?
@scraped_data.concat(page_data)
puts "Scraped protected page #{current_page}"
current_page += 1
sleep(2) # Longer delay for authenticated requests
end
@scraped_data
end
private
def fetch_page_with_session(url)
uri = URI.parse(url)
http = Net::HTTP.new(uri.host, uri.port)
http.use_ssl = true if uri.scheme == 'https'
request = Net::HTTP::Get.new(uri.request_uri)
request['Cookie'] = HTTP::Cookie.cookie_value(@cookie_jar.cookies(uri))
request['User-Agent'] = 'Mozilla/5.0 (compatible; Ruby scraper)'
response = http.request(request)
# Store cookies
response.get_fields('Set-Cookie')&.each do |cookie|
@cookie_jar.parse(cookie, uri)
end
return Nokogiri::HTML(response.body) if response.code == '200'
nil
end
def post_form(url, data)
uri = URI.parse(url)
http = Net::HTTP.new(uri.host, uri.port)
http.use_ssl = true if uri.scheme == 'https'
request = Net::HTTP::Post.new(uri.request_uri)
request.set_form_data(data)
request['Cookie'] = HTTP::Cookie.cookie_value(@cookie_jar.cookies(uri))
response = http.request(request)
# Store cookies from login response
response.get_fields('Set-Cookie')&.each do |cookie|
@cookie_jar.parse(cookie, uri)
end
response
end
def extract_protected_data(doc)
# Extract data specific to protected pages
doc.css('.private-content .item').map do |item|
{
title: item.css('.title').text.strip,
content: item.css('.content').text.strip,
date: item.css('.date').text.strip
}
end
end
end
Handling Dynamic Pagination Detection
Sometimes pagination patterns aren't consistent. Here's a robust approach that detects pagination automatically:
class SmartPaginationScraper < PaginationScraper
def auto_detect_and_scrape
first_page = fetch_page(@base_url)
return [] unless first_page
pagination_type = detect_pagination_type(first_page)
case pagination_type
when :numbered
scrape_numbered_pages
when :next_button
scrape_with_next_buttons
when :infinite_scroll
puts "Infinite scroll detected - consider using browser automation"
[]
else
puts "No pagination detected, scraping single page"
extract_data_from_page(first_page)
end
end
private
def detect_pagination_type(doc)
# Check for numbered pagination
if doc.css('.pagination a[href*="page="]').any? ||
doc.css('a').any? { |link| link['href']&.match(/page=\d+/) }
return :numbered
end
# Check for next button
next_selectors = ['a[rel="next"]', 'a:contains("Next")', '.next a']
if next_selectors.any? { |selector| doc.css(selector).any? }
return :next_button
end
# Check for infinite scroll indicators
if doc.css('[data-infinite-scroll]').any? ||
doc.css('.load-more').any?
return :infinite_scroll
end
:single_page
end
def scrape_numbered_pages
# Extract page numbers from pagination links
first_page = fetch_page(@base_url)
page_links = first_page.css('.pagination a[href*="page="]')
max_page = page_links.map do |link|
link['href'].match(/page=(\d+)/)&.[](1)&.to_i
end.compact.max || 1
(1..max_page).each do |page_num|
url = @base_url.include?('?') ?
"#{@base_url}&page=#{page_num}" :
"#{@base_url}?page=#{page_num}"
doc = fetch_page(url)
next unless doc
page_data = extract_data_from_page(doc)
@scraped_data.concat(page_data)
puts "Scraped page #{page_num}/#{max_page}"
sleep(1)
end
@scraped_data
end
def scrape_with_next_buttons
# Implementation similar to NextButtonScraper
current_url = @base_url
page_count = 0
loop do
doc = fetch_page(current_url)
break unless doc
page_data = extract_data_from_page(doc)
break if page_data.empty?
@scraped_data.concat(page_data)
page_count += 1
puts "Scraped page #{page_count}"
next_link = doc.css('a[rel="next"]').first
break unless next_link
current_url = resolve_absolute_url(next_link['href'])
sleep(1)
end
@scraped_data
end
def resolve_absolute_url(url)
return url if url.start_with?('http')
base_uri = URI.parse(@base_url)
if url.start_with?('/')
"#{base_uri.scheme}://#{base_uri.host}#{url}"
else
"#{@base_url.chomp('/')}/#{url}"
end
end
end
Best Practices and Error Handling
When scraping paginated websites, follow these best practices:
class RobustPaginationScraper
MAX_RETRIES = 3
RETRY_DELAY = 5
def scrape_with_retry(url)
retries = 0
begin
doc = fetch_page(url)
return doc if doc
raise "Failed to fetch page"
rescue => e
retries += 1
if retries <= MAX_RETRIES
puts "Retry #{retries}/#{MAX_RETRIES} for #{url}: #{e.message}"
sleep(RETRY_DELAY * retries) # Exponential backoff
retry
else
puts "Max retries exceeded for #{url}"
nil
end
end
end
def respectful_delay
# Random delay between 1-3 seconds
sleep(1 + rand(2))
end
def save_progress(data, filename = 'scraping_progress.json')
File.write(filename, JSON.pretty_generate(data))
end
def load_progress(filename = 'scraping_progress.json')
return [] unless File.exist?(filename)
JSON.parse(File.read(filename))
rescue JSON::ParserError
[]
end
end
Limitations and Alternatives
While Nokogiri is excellent for server-side rendered content, it has limitations with JavaScript-heavy pagination:
When to Use Nokogiri
- Server-side rendered pagination
- Simple AJAX pagination with accessible URLs
- Static HTML pagination patterns
When to Consider Alternatives
- JavaScript-rendered pagination
- Infinite scroll implementations
- Complex SPA pagination
For JavaScript-heavy sites, consider using browser automation tools like Puppeteer, which can handle dynamic content that loads after page navigation and manage complex browser sessions.
Conclusion
Nokogiri provides powerful tools for scraping paginated websites when combined with proper navigation logic and error handling. The key is identifying the pagination pattern, implementing robust navigation, and being respectful with request timing. Start with simple numbered pagination, then adapt your approach based on the specific patterns you encounter.
Remember to always check a website's robots.txt file and terms of service before scraping, and implement appropriate delays to avoid overwhelming the server.