Table of contents

How do I handle pagination when scraping multiple pages with Ruby?

Pagination is one of the most common challenges in web scraping, especially when dealing with large datasets spread across multiple pages. Whether you're scraping e-commerce listings, search results, or blog archives, handling pagination efficiently is crucial for successful data extraction. This comprehensive guide covers various pagination patterns and how to handle them using Ruby.

Understanding Common Pagination Patterns

Before diving into implementation, it's important to understand the different types of pagination you might encounter:

  1. Number-based pagination - Pages numbered 1, 2, 3, etc.
  2. Offset-based pagination - Using parameters like offset and limit
  3. Cursor-based pagination - Using unique identifiers to navigate
  4. Next/Previous link pagination - Following "Next" buttons or links
  5. Infinite scroll pagination - Content loaded dynamically via AJAX

Basic Pagination with HTTParty and Nokogiri

Let's start with a simple example using HTTParty for HTTP requests and Nokogiri for HTML parsing:

require 'httparty'
require 'nokogiri'

class PaginationScraper
  def initialize(base_url)
    @base_url = base_url
    @current_page = 1
    @scraped_data = []
  end

  def scrape_all_pages
    loop do
      page_url = build_page_url(@current_page)
      response = HTTParty.get(page_url)

      break unless response.success?

      doc = Nokogiri::HTML(response.body)

      # Extract data from current page
      page_data = extract_page_data(doc)
      break if page_data.empty?

      @scraped_data.concat(page_data)

      # Check if there's a next page
      break unless has_next_page?(doc)

      @current_page += 1

      # Be respectful - add delay between requests
      sleep(1)
    end

    @scraped_data
  end

  private

  def build_page_url(page_number)
    "#{@base_url}?page=#{page_number}"
  end

  def extract_page_data(doc)
    # Extract your specific data here
    doc.css('.item').map do |item|
      {
        title: item.css('.title').text.strip,
        price: item.css('.price').text.strip,
        url: item.css('a')['href']
      }
    end
  end

  def has_next_page?(doc)
    # Check if next page link exists
    doc.css('.pagination .next').any?
  end
end

# Usage
scraper = PaginationScraper.new('https://example.com/products')
all_data = scraper.scrape_all_pages
puts "Scraped #{all_data.length} items"

Advanced Pagination Handling with Mechanize

For more complex scenarios involving forms or session management, Mechanize provides a more robust solution:

require 'mechanize'

class AdvancedPaginationScraper
  def initialize
    @agent = Mechanize.new
    @agent.user_agent_alias = 'Windows Chrome'
    @scraped_items = []
  end

  def scrape_with_form_pagination(start_url)
    page = @agent.get(start_url)

    loop do
      # Extract data from current page
      items = extract_items_from_page(page)
      break if items.empty?

      @scraped_items.concat(items)
      puts "Scraped page with #{items.length} items"

      # Look for next page link or button
      next_link = page.link_with(text: /next/i) || 
                  page.link_with(text: /more/i)

      break unless next_link

      # Follow the next page link
      page = next_link.click

      # Random delay to avoid being blocked
      sleep(rand(1..3))
    end

    @scraped_items
  end

  def scrape_with_post_pagination(start_url, form_data = {})
    page = @agent.get(start_url)
    page_num = 1

    loop do
      # Extract data from current page
      items = extract_items_from_page(page)
      break if items.empty?

      @scraped_items.concat(items)

      # Find pagination form
      form = page.form_with(action: /search/) || page.forms.first
      break unless form

      # Update form data for next page
      form_data.each { |key, value| form[key] = value }
      form['page'] = page_num + 1

      # Submit form to get next page
      begin
        page = @agent.submit(form)
        page_num += 1
      rescue Mechanize::ResponseCodeError => e
        puts "Error: #{e.message}"
        break
      end

      sleep(rand(2..4))
    end

    @scraped_items
  end

  private

  def extract_items_from_page(page)
    page.search('.product-item').map do |item|
      {
        name: item.at('.product-name')&.text&.strip,
        price: item.at('.price')&.text&.strip,
        image: item.at('img')&.[]('src'),
        link: item.at('a')&.[]('href')
      }.compact
    end
  end
end

Handling AJAX Pagination

Many modern websites use AJAX for pagination. Here's how to handle it using Ruby with HTTP requests:

require 'httparty'
require 'json'

class AjaxPaginationScraper
  include HTTParty

  def initialize(base_url)
    @base_url = base_url
    @headers = {
      'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
      'Accept' => 'application/json, text/javascript, */*; q=0.01',
      'X-Requested-With' => 'XMLHttpRequest'
    }
  end

  def scrape_ajax_pagination(endpoint, initial_params = {})
    all_data = []
    page = 1

    loop do
      params = initial_params.merge(page: page, per_page: 20)

      response = self.class.get(
        "#{@base_url}/#{endpoint}",
        query: params,
        headers: @headers
      )

      break unless response.success?

      data = JSON.parse(response.body)
      items = data['items'] || data['results'] || []

      break if items.empty?

      all_data.concat(items)

      # Check if we've reached the last page
      if data['pagination']
        total_pages = data['pagination']['total_pages']
        break if page >= total_pages
      else
        # If no pagination info, check if items < per_page
        break if items.length < 20
      end

      page += 1
      sleep(1)
    end

    all_data
  end

  def scrape_infinite_scroll(endpoint, max_pages = nil)
    all_data = []
    offset = 0
    limit = 50
    pages_scraped = 0

    loop do
      break if max_pages && pages_scraped >= max_pages

      params = { offset: offset, limit: limit }

      response = self.class.get(
        "#{@base_url}/#{endpoint}",
        query: params,
        headers: @headers
      )

      break unless response.success?

      data = JSON.parse(response.body)
      items = data['items'] || []

      break if items.empty?

      all_data.concat(items)
      offset += limit
      pages_scraped += 1

      puts "Scraped page #{pages_scraped}, total items: #{all_data.length}"

      sleep(rand(1..2))
    end

    all_data
  end
end

Robust Error Handling and Retry Logic

When scraping paginated content, it's crucial to implement proper error handling:

require 'httparty'

class RobustPaginationScraper
  MAX_RETRIES = 3
  RETRY_DELAY = 2

  def initialize(base_url)
    @base_url = base_url
  end

  def scrape_with_retries(max_pages = nil)
    all_data = []
    page = 1
    consecutive_failures = 0
    max_consecutive_failures = 3

    loop do
      break if max_pages && page > max_pages
      break if consecutive_failures >= max_consecutive_failures

      begin
        page_data = scrape_single_page(page)

        if page_data.empty?
          puts "No data found on page #{page}, stopping"
          break
        end

        all_data.concat(page_data)
        consecutive_failures = 0
        puts "Successfully scraped page #{page}: #{page_data.length} items"

      rescue StandardError => e
        consecutive_failures += 1
        puts "Error on page #{page}: #{e.message}"

        if consecutive_failures < max_consecutive_failures
          puts "Retrying page #{page} (attempt #{consecutive_failures})"
          sleep(RETRY_DELAY * consecutive_failures)
          next
        else
          puts "Max consecutive failures reached, stopping"
          break
        end
      end

      page += 1
      sleep(rand(1..3))
    end

    all_data
  end

  private

  def scrape_single_page(page_number)
    retries = 0

    begin
      url = "#{@base_url}?page=#{page_number}"
      response = HTTParty.get(url, timeout: 30)

      raise "HTTP Error: #{response.code}" unless response.success?

      doc = Nokogiri::HTML(response.body)
      extract_data(doc)

    rescue StandardError => e
      retries += 1
      if retries <= MAX_RETRIES
        sleep(RETRY_DELAY * retries)
        retry
      else
        raise e
      end
    end
  end

  def extract_data(doc)
    # Your data extraction logic here
    doc.css('.item').map do |item|
      {
        title: item.css('.title').text.strip,
        description: item.css('.description').text.strip
      }
    end
  end
end

Performance Optimization with Concurrent Processing

For large-scale scraping, you can implement concurrent processing:

require 'concurrent-ruby'
require 'httparty'

class ConcurrentPaginationScraper
  def initialize(base_url, max_workers: 5)
    @base_url = base_url
    @max_workers = max_workers
  end

  def scrape_pages_concurrently(page_range)
    executor = Concurrent::ThreadPoolExecutor.new(
      min_threads: 1,
      max_threads: @max_workers,
      max_queue: 100
    )

    futures = page_range.map do |page_num|
      Concurrent::Future.execute(executor: executor) do
        scrape_page(page_num)
      end
    end

    # Wait for all futures to complete and collect results
    results = futures.map(&:value!).compact.flatten

    executor.shutdown
    executor.wait_for_termination

    results
  end

  private

  def scrape_page(page_number)
    begin
      url = "#{@base_url}?page=#{page_number}"
      response = HTTParty.get(url, timeout: 15)
      return [] unless response.success?

      doc = Nokogiri::HTML(response.body)
      items = extract_items(doc)

      puts "Page #{page_number}: #{items.length} items"
      items

    rescue StandardError => e
      puts "Error scraping page #{page_number}: #{e.message}"
      []
    end
  end

  def extract_items(doc)
    # Your extraction logic
    doc.css('.item').map { |item| { text: item.text.strip } }
  end
end

# Usage
scraper = ConcurrentPaginationScraper.new('https://example.com/data')
results = scraper.scrape_pages_concurrently(1..50)

Working with API Pagination

Many modern websites provide APIs with built-in pagination. Here's how to handle API pagination:

require 'httparty'
require 'json'

class ApiPaginationScraper
  include HTTParty

  def initialize(api_key = nil)
    @headers = {
      'Content-Type' => 'application/json',
      'User-Agent' => 'Ruby Web Scraper'
    }
    @headers['Authorization'] = "Bearer #{api_key}" if api_key
  end

  def scrape_cursor_based_api(base_url, initial_cursor = nil)
    all_data = []
    cursor = initial_cursor

    loop do
      params = cursor ? { cursor: cursor, limit: 100 } : { limit: 100 }

      response = self.class.get(base_url, query: params, headers: @headers)
      break unless response.success?

      data = JSON.parse(response.body)
      items = data['data'] || data['results'] || []

      break if items.empty?

      all_data.concat(items)

      # Get next cursor
      cursor = data.dig('pagination', 'next_cursor') || data['next_cursor']
      break unless cursor

      puts "Fetched #{items.length} items, total: #{all_data.length}"
      sleep(0.5) # Rate limiting
    end

    all_data
  end

  def scrape_offset_based_api(base_url, limit = 100)
    all_data = []
    offset = 0

    loop do
      params = { offset: offset, limit: limit }

      response = self.class.get(base_url, query: params, headers: @headers)
      break unless response.success?

      data = JSON.parse(response.body)
      items = data['data'] || data['results'] || []

      break if items.empty?

      all_data.concat(items)

      # Check if we've reached the end
      total = data['total'] || data['count']
      if total && (offset + limit) >= total
        break
      end

      offset += limit
      puts "Fetched #{items.length} items, total: #{all_data.length}"
      sleep(0.5)
    end

    all_data
  end
end

Best Practices for Pagination Scraping

1. Respect Rate Limits

Always include delays between requests to avoid overwhelming the server:

# Random delays to appear more human-like
sleep(rand(1.0..3.0))

# Exponential backoff for errors
def exponential_backoff(attempt)
  sleep(2 ** attempt)
end

2. Handle Different Response Formats

Some pagination might return different formats:

def parse_response(response)
  content_type = response.headers['content-type']

  case content_type
  when /json/
    JSON.parse(response.body)
  when /xml/
    Nokogiri::XML(response.body)
  else
    Nokogiri::HTML(response.body)
  end
end

3. Implement Checkpointing

For long-running scrapes, save progress periodically:

require 'json'

def scrape_with_checkpoint(checkpoint_file = 'scraping_progress.json')
  progress = load_checkpoint(checkpoint_file)
  start_page = progress['last_page'] || 1

  (start_page..total_pages).each do |page|
    data = scrape_page(page)
    save_data(data)

    # Update checkpoint every 10 pages
    if page % 10 == 0
      save_checkpoint(checkpoint_file, page)
    end
  end
end

def load_checkpoint(file)
  return {} unless File.exist?(file)
  JSON.parse(File.read(file))
rescue JSON::ParserError
  {}
end

def save_checkpoint(file, page)
  checkpoint = { last_page: page, timestamp: Time.now.to_i }
  File.write(file, JSON.pretty_generate(checkpoint))
end

4. Monitor and Log Progress

require 'logger'

class LoggedPaginationScraper
  def initialize(base_url)
    @base_url = base_url
    @logger = Logger.new('scraping.log')
    @logger.level = Logger::INFO
  end

  def scrape_with_logging
    @logger.info("Starting pagination scraping for #{@base_url}")

    page = 1
    total_items = 0

    loop do
      @logger.info("Processing page #{page}")

      begin
        items = scrape_page(page)
        break if items.empty?

        total_items += items.length
        @logger.info("Page #{page}: #{items.length} items (total: #{total_items})")

      rescue StandardError => e
        @logger.error("Error on page #{page}: #{e.message}")
        break
      end

      page += 1
      sleep(rand(1..2))
    end

    @logger.info("Scraping completed. Total items: #{total_items}")
    total_items
  end
end

Conclusion

Handling pagination in Ruby web scraping requires understanding the specific pagination mechanism used by your target website and implementing appropriate strategies for navigation, error handling, and performance optimization. The examples provided cover the most common scenarios you'll encounter, from simple numbered pagination to complex AJAX-based systems and API pagination.

Key takeaways for successful pagination handling:

  • Always implement proper error handling and retry logic
  • Respect rate limits with appropriate delays between requests
  • Use checkpointing for long-running scrapes
  • Consider concurrent processing for better performance
  • Log your progress for debugging and monitoring

Remember to always respect the website's robots.txt file, implement proper rate limiting, and consider using web scraping APIs when dealing with complex, JavaScript-heavy sites that require more sophisticated handling than traditional HTTP clients can provide.

For scenarios involving heavy JavaScript rendering or complex user interactions during pagination, you might want to consider browser automation solutions that can handle dynamic content more effectively than traditional HTTP scraping methods.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon