How do I handle pagination when scraping multiple pages with HTTParty?

Pagination is one of the most common challenges when scraping websites that display data across multiple pages. HTTParty, Ruby's popular HTTP client library, provides excellent tools for handling various pagination patterns. This guide covers different pagination strategies and implementation techniques for efficient multi-page scraping.

Understanding Pagination Patterns

Before diving into implementation, it's important to understand the common pagination patterns you'll encounter:

URL-based pagination - Page numbers or offsets in the URL
Next/Previous links - HTML links to navigate between pages
API-based pagination - JSON responses with pagination metadata
Infinite scroll - Dynamic content loading (requires JavaScript execution)

Basic Pagination Setup

First, let's establish a basic HTTParty class structure for pagination:

require 'httparty'
require 'nokogiri'

class PaginatedScraper
  include HTTParty

  base_uri 'https://example.com'
  headers 'User-Agent' => 'Mozilla/5.0 (compatible; Ruby HTTParty scraper)'

  def initialize
    @results = []
    @delay = 1 # Respectful delay between requests
  end

  private

  def sleep_between_requests
    sleep(@delay)
  end
end

URL-Based Pagination

Sequential Page Numbers

The most straightforward pagination pattern uses page numbers in the URL:

class SequentialPagination < PaginatedScraper
  def scrape_all_pages(base_url, max_pages = 50)
    page = 1

    loop do
      url = "#{base_url}?page=#{page}"
      response = self.class.get(url)

      break unless response.success?

      data = extract_data(response)
      break if data.empty? # No more data

      @results.concat(data)
      puts "Scraped page #{page}: #{data.length} items"

      page += 1
      break if page > max_pages

      sleep_between_requests
    end

    @results
  end

  private

  def extract_data(response)
    doc = Nokogiri::HTML(response.body)
    # Extract your specific data here
    doc.css('.item').map do |item|
      {
        title: item.css('.title').text.strip,
        url: item.css('a')&.first&.[]('href')
      }
    end
  end
end

# Usage
scraper = SequentialPagination.new
results = scraper.scrape_all_pages('https://example.com/products')

Offset-Based Pagination

Some websites use offset and limit parameters:

class OffsetPagination < PaginatedScraper
  def scrape_with_offset(base_url, limit = 20, max_items = 1000)
    offset = 0

    loop do
      url = "#{base_url}?limit=#{limit}&offset=#{offset}"
      response = self.class.get(url)

      break unless response.success?

      data = extract_data(response)
      break if data.empty?

      @results.concat(data)
      puts "Scraped #{@results.length} total items"

      offset += limit
      break if @results.length >= max_items

      sleep_between_requests
    end

    @results.first(max_items)
  end
end

Following Next Links

Many websites provide "Next" links in their HTML. This approach is more reliable than URL manipulation:

class NextLinkPagination < PaginatedScraper
  def scrape_following_links(start_url)
    current_url = start_url

    loop do
      response = self.class.get(current_url)
      break unless response.success?

      doc = Nokogiri::HTML(response.body)

      # Extract data from current page
      data = extract_data(response)
      break if data.empty?

      @results.concat(data)
      puts "Scraped page: #{@results.length} total items"

      # Find next page link
      next_link = doc.css('a[rel="next"]').first || 
                  doc.css('.pagination .next').first ||
                  doc.css('a:contains("Next")').first

      break unless next_link

      current_url = resolve_url(next_link['href'], current_url)
      sleep_between_requests
    end

    @results
  end

  private

  def resolve_url(href, base_url)
    if href.start_with?('http')
      href
    else
      URI.join(base_url, href).to_s
    end
  end
end

API Pagination with JSON Responses

When scraping APIs that return JSON with pagination metadata:

class APIPagination < PaginatedScraper
  def scrape_api_pages(api_endpoint, params = {})
    page = 1

    loop do
      current_params = params.merge(page: page)
      response = self.class.get(api_endpoint, query: current_params)

      break unless response.success?

      json_data = JSON.parse(response.body)

      # Extract items from API response
      items = json_data['data'] || json_data['items'] || []
      break if items.empty?

      @results.concat(items)
      puts "API page #{page}: #{items.length} items"

      # Check pagination metadata
      pagination = json_data['pagination'] || json_data['meta']
      break unless pagination&.dig('has_more') || 
                   pagination&.dig('current_page') < pagination&.dig('total_pages')

      page += 1
      sleep_between_requests
    end

    @results
  end
end

# Usage with query parameters
scraper = APIPagination.new
results = scraper.scrape_api_pages(
  'https://api.example.com/products',
  { category: 'electronics', per_page: 50 }
)

Advanced Pagination Techniques

Cursor-Based Pagination

Some modern APIs use cursor-based pagination for better performance:

class CursorPagination < PaginatedScraper
  def scrape_with_cursor(api_endpoint, params = {})
    cursor = nil

    loop do
      current_params = params.dup
      current_params[:cursor] = cursor if cursor

      response = self.class.get(api_endpoint, query: current_params)
      break unless response.success?

      json_data = JSON.parse(response.body)
      items = json_data['data'] || []
      break if items.empty?

      @results.concat(items)

      # Get next cursor
      cursor = json_data.dig('pagination', 'next_cursor')
      break unless cursor

      sleep_between_requests
    end

    @results
  end
end

Handling Dynamic Parameters

Sometimes pagination URLs contain dynamic tokens or require form data:

class DynamicPagination < PaginatedScraper
  def scrape_with_dynamic_params(start_url)
    # Get initial page to extract pagination parameters
    response = self.class.get(start_url)
    doc = Nokogiri::HTML(response.body)

    # Extract dynamic tokens (CSRF tokens, session IDs, etc.)
    csrf_token = doc.css('input[name="csrf_token"]').first&.[]('value')
    session_id = doc.css('input[name="session_id"]').first&.[]('value')

    page = 1

    loop do
      form_data = {
        page: page,
        csrf_token: csrf_token,
        session_id: session_id
      }

      response = self.class.post('/search', body: form_data)
      break unless response.success?

      data = extract_data(response)
      break if data.empty?

      @results.concat(data)
      page += 1

      sleep_between_requests
    end

    @results
  end
end

Error Handling and Resilience

Robust pagination scraping requires proper error handling:

class ResilientPagination < PaginatedScraper
  MAX_RETRIES = 3
  RETRY_DELAY = 5

  def scrape_with_retry(urls)
    urls.each_with_index do |url, index|
      retries = 0

      begin
        response = self.class.get(url)

        if response.success?
          data = extract_data(response)
          @results.concat(data)
          puts "Processed #{index + 1}/#{urls.length}: #{data.length} items"
        else
          raise "HTTP #{response.code}: #{response.message}"
        end

      rescue => e
        retries += 1

        if retries <= MAX_RETRIES
          puts "Error on #{url}: #{e.message}. Retry #{retries}/#{MAX_RETRIES}"
          sleep(RETRY_DELAY * retries) # Exponential backoff
          retry
        else
          puts "Failed permanently: #{url} - #{e.message}"
        end
      end

      sleep_between_requests
    end

    @results
  end
end

Performance Optimization

Concurrent Requests

For faster scraping, you can process multiple pages concurrently:

require 'concurrent-ruby'

class ConcurrentPagination < PaginatedScraper
  def scrape_concurrently(urls, max_threads = 5)
    results = Concurrent::Array.new

    # Create thread pool
    pool = Concurrent::ThreadPoolExecutor.new(
      min_threads: 2,
      max_threads: max_threads,
      max_queue: urls.length
    )

    futures = urls.map do |url|
      Concurrent::Future.execute(executor: pool) do
        response = self.class.get(url)
        if response.success?
          extract_data(response)
        else
          []
        end
      end
    end

    # Wait for all requests to complete
    futures.each { |future| results.concat(future.value || []) }

    pool.shutdown
    pool.wait_for_termination

    results.to_a
  end
end

Monitoring and Rate Limiting

When scraping multiple pages, it's crucial to implement proper rate limiting and monitoring. For more complex scenarios involving dynamic content or JavaScript-heavy pagination, consider using browser automation tools like Puppeteer for handling dynamic pagination patterns.

Rate Limiting Implementation

class RateLimitedPagination < PaginatedScraper
  def initialize(requests_per_second = 1)
    super()
    @rate_limiter = Concurrent::TimerTask.new(execution_interval: 1.0 / requests_per_second) do
      @semaphore.release if @semaphore.available_permits == 0
    end
    @semaphore = Concurrent::Semaphore.new(1)
    @rate_limiter.execute
  end

  def scrape_with_rate_limit(urls)
    urls.each do |url|
      @semaphore.acquire

      response = self.class.get(url)
      if response.success?
        data = extract_data(response)
        @results.concat(data)
      end
    end

    @rate_limiter.shutdown
    @results
  end
end

Best Practices for Pagination Scraping

Always check robots.txt before scraping
Implement respectful delays between requests
Use appropriate User-Agent headers
Handle errors gracefully with retry logic
Monitor memory usage for large datasets
Save progress periodically for long-running scrapes
Validate data integrity across pages

For scenarios where pagination involves complex user interactions or AJAX requests, you might need browser automation tools that can handle dynamic content loading.

Testing Your Pagination Logic

require 'rspec'

RSpec.describe PaginatedScraper do
  let(:scraper) { described_class.new }

  it 'handles empty pages gracefully' do
    allow(HTTParty).to receive(:get).and_return(
      double(success?: true, body: '<html></html>')
    )

    results = scraper.scrape_all_pages('http://test.com')
    expect(results).to be_empty
  end

  it 'stops on HTTP errors' do
    allow(HTTParty).to receive(:get).and_return(
      double(success?: false, code: 404)
    )

    results = scraper.scrape_all_pages('http://test.com')
    expect(results).to be_empty
  end
end

HTTParty provides excellent flexibility for handling various pagination patterns. The key is identifying the specific pagination mechanism used by your target website and implementing appropriate logic with proper error handling and rate limiting. Remember to always respect the website's terms of service and implement responsible scraping practices.

Table of contents

How do I handle pagination when scraping multiple pages with HTTParty?

Understanding Pagination Patterns

Basic Pagination Setup

URL-Based Pagination

Sequential Page Numbers

Offset-Based Pagination

Following Next Links

API Pagination with JSON Responses

Advanced Pagination Techniques

Cursor-Based Pagination

Handling Dynamic Parameters

Error Handling and Resilience

Performance Optimization

Concurrent Requests

Monitoring and Rate Limiting

Rate Limiting Implementation

Best Practices for Pagination Scraping

Testing Your Pagination Logic

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Related Questions

What are the threading considerations when using HTTParty in concurrent applications?

How can I optimize HTTParty performance for high-volume web scraping?

How do I handle cookies with custom domains and paths in HTTParty?

Get Started Now

Support