Table of contents

How do I handle pagination when scraping multiple pages with HTTParty?

Pagination is one of the most common challenges when scraping websites that display data across multiple pages. HTTParty, Ruby's popular HTTP client library, provides excellent tools for handling various pagination patterns. This guide covers different pagination strategies and implementation techniques for efficient multi-page scraping.

Understanding Pagination Patterns

Before diving into implementation, it's important to understand the common pagination patterns you'll encounter:

  1. URL-based pagination - Page numbers or offsets in the URL
  2. Next/Previous links - HTML links to navigate between pages
  3. API-based pagination - JSON responses with pagination metadata
  4. Infinite scroll - Dynamic content loading (requires JavaScript execution)

Basic Pagination Setup

First, let's establish a basic HTTParty class structure for pagination:

require 'httparty'
require 'nokogiri'

class PaginatedScraper
  include HTTParty

  base_uri 'https://example.com'
  headers 'User-Agent' => 'Mozilla/5.0 (compatible; Ruby HTTParty scraper)'

  def initialize
    @results = []
    @delay = 1 # Respectful delay between requests
  end

  private

  def sleep_between_requests
    sleep(@delay)
  end
end

URL-Based Pagination

Sequential Page Numbers

The most straightforward pagination pattern uses page numbers in the URL:

class SequentialPagination < PaginatedScraper
  def scrape_all_pages(base_url, max_pages = 50)
    page = 1

    loop do
      url = "#{base_url}?page=#{page}"
      response = self.class.get(url)

      break unless response.success?

      data = extract_data(response)
      break if data.empty? # No more data

      @results.concat(data)
      puts "Scraped page #{page}: #{data.length} items"

      page += 1
      break if page > max_pages

      sleep_between_requests
    end

    @results
  end

  private

  def extract_data(response)
    doc = Nokogiri::HTML(response.body)
    # Extract your specific data here
    doc.css('.item').map do |item|
      {
        title: item.css('.title').text.strip,
        url: item.css('a')&.first&.[]('href')
      }
    end
  end
end

# Usage
scraper = SequentialPagination.new
results = scraper.scrape_all_pages('https://example.com/products')

Offset-Based Pagination

Some websites use offset and limit parameters:

class OffsetPagination < PaginatedScraper
  def scrape_with_offset(base_url, limit = 20, max_items = 1000)
    offset = 0

    loop do
      url = "#{base_url}?limit=#{limit}&offset=#{offset}"
      response = self.class.get(url)

      break unless response.success?

      data = extract_data(response)
      break if data.empty?

      @results.concat(data)
      puts "Scraped #{@results.length} total items"

      offset += limit
      break if @results.length >= max_items

      sleep_between_requests
    end

    @results.first(max_items)
  end
end

Following Next Links

Many websites provide "Next" links in their HTML. This approach is more reliable than URL manipulation:

class NextLinkPagination < PaginatedScraper
  def scrape_following_links(start_url)
    current_url = start_url

    loop do
      response = self.class.get(current_url)
      break unless response.success?

      doc = Nokogiri::HTML(response.body)

      # Extract data from current page
      data = extract_data(response)
      break if data.empty?

      @results.concat(data)
      puts "Scraped page: #{@results.length} total items"

      # Find next page link
      next_link = doc.css('a[rel="next"]').first || 
                  doc.css('.pagination .next').first ||
                  doc.css('a:contains("Next")').first

      break unless next_link

      current_url = resolve_url(next_link['href'], current_url)
      sleep_between_requests
    end

    @results
  end

  private

  def resolve_url(href, base_url)
    if href.start_with?('http')
      href
    else
      URI.join(base_url, href).to_s
    end
  end
end

API Pagination with JSON Responses

When scraping APIs that return JSON with pagination metadata:

class APIPagination < PaginatedScraper
  def scrape_api_pages(api_endpoint, params = {})
    page = 1

    loop do
      current_params = params.merge(page: page)
      response = self.class.get(api_endpoint, query: current_params)

      break unless response.success?

      json_data = JSON.parse(response.body)

      # Extract items from API response
      items = json_data['data'] || json_data['items'] || []
      break if items.empty?

      @results.concat(items)
      puts "API page #{page}: #{items.length} items"

      # Check pagination metadata
      pagination = json_data['pagination'] || json_data['meta']
      break unless pagination&.dig('has_more') || 
                   pagination&.dig('current_page') < pagination&.dig('total_pages')

      page += 1
      sleep_between_requests
    end

    @results
  end
end

# Usage with query parameters
scraper = APIPagination.new
results = scraper.scrape_api_pages(
  'https://api.example.com/products',
  { category: 'electronics', per_page: 50 }
)

Advanced Pagination Techniques

Cursor-Based Pagination

Some modern APIs use cursor-based pagination for better performance:

class CursorPagination < PaginatedScraper
  def scrape_with_cursor(api_endpoint, params = {})
    cursor = nil

    loop do
      current_params = params.dup
      current_params[:cursor] = cursor if cursor

      response = self.class.get(api_endpoint, query: current_params)
      break unless response.success?

      json_data = JSON.parse(response.body)
      items = json_data['data'] || []
      break if items.empty?

      @results.concat(items)

      # Get next cursor
      cursor = json_data.dig('pagination', 'next_cursor')
      break unless cursor

      sleep_between_requests
    end

    @results
  end
end

Handling Dynamic Parameters

Sometimes pagination URLs contain dynamic tokens or require form data:

class DynamicPagination < PaginatedScraper
  def scrape_with_dynamic_params(start_url)
    # Get initial page to extract pagination parameters
    response = self.class.get(start_url)
    doc = Nokogiri::HTML(response.body)

    # Extract dynamic tokens (CSRF tokens, session IDs, etc.)
    csrf_token = doc.css('input[name="csrf_token"]').first&.[]('value')
    session_id = doc.css('input[name="session_id"]').first&.[]('value')

    page = 1

    loop do
      form_data = {
        page: page,
        csrf_token: csrf_token,
        session_id: session_id
      }

      response = self.class.post('/search', body: form_data)
      break unless response.success?

      data = extract_data(response)
      break if data.empty?

      @results.concat(data)
      page += 1

      sleep_between_requests
    end

    @results
  end
end

Error Handling and Resilience

Robust pagination scraping requires proper error handling:

class ResilientPagination < PaginatedScraper
  MAX_RETRIES = 3
  RETRY_DELAY = 5

  def scrape_with_retry(urls)
    urls.each_with_index do |url, index|
      retries = 0

      begin
        response = self.class.get(url)

        if response.success?
          data = extract_data(response)
          @results.concat(data)
          puts "Processed #{index + 1}/#{urls.length}: #{data.length} items"
        else
          raise "HTTP #{response.code}: #{response.message}"
        end

      rescue => e
        retries += 1

        if retries <= MAX_RETRIES
          puts "Error on #{url}: #{e.message}. Retry #{retries}/#{MAX_RETRIES}"
          sleep(RETRY_DELAY * retries) # Exponential backoff
          retry
        else
          puts "Failed permanently: #{url} - #{e.message}"
        end
      end

      sleep_between_requests
    end

    @results
  end
end

Performance Optimization

Concurrent Requests

For faster scraping, you can process multiple pages concurrently:

require 'concurrent-ruby'

class ConcurrentPagination < PaginatedScraper
  def scrape_concurrently(urls, max_threads = 5)
    results = Concurrent::Array.new

    # Create thread pool
    pool = Concurrent::ThreadPoolExecutor.new(
      min_threads: 2,
      max_threads: max_threads,
      max_queue: urls.length
    )

    futures = urls.map do |url|
      Concurrent::Future.execute(executor: pool) do
        response = self.class.get(url)
        if response.success?
          extract_data(response)
        else
          []
        end
      end
    end

    # Wait for all requests to complete
    futures.each { |future| results.concat(future.value || []) }

    pool.shutdown
    pool.wait_for_termination

    results.to_a
  end
end

Monitoring and Rate Limiting

When scraping multiple pages, it's crucial to implement proper rate limiting and monitoring. For more complex scenarios involving dynamic content or JavaScript-heavy pagination, consider using browser automation tools like Puppeteer for handling dynamic pagination patterns.

Rate Limiting Implementation

class RateLimitedPagination < PaginatedScraper
  def initialize(requests_per_second = 1)
    super()
    @rate_limiter = Concurrent::TimerTask.new(execution_interval: 1.0 / requests_per_second) do
      @semaphore.release if @semaphore.available_permits == 0
    end
    @semaphore = Concurrent::Semaphore.new(1)
    @rate_limiter.execute
  end

  def scrape_with_rate_limit(urls)
    urls.each do |url|
      @semaphore.acquire

      response = self.class.get(url)
      if response.success?
        data = extract_data(response)
        @results.concat(data)
      end
    end

    @rate_limiter.shutdown
    @results
  end
end

Best Practices for Pagination Scraping

  1. Always check robots.txt before scraping
  2. Implement respectful delays between requests
  3. Use appropriate User-Agent headers
  4. Handle errors gracefully with retry logic
  5. Monitor memory usage for large datasets
  6. Save progress periodically for long-running scrapes
  7. Validate data integrity across pages

For scenarios where pagination involves complex user interactions or AJAX requests, you might need browser automation tools that can handle dynamic content loading.

Testing Your Pagination Logic

require 'rspec'

RSpec.describe PaginatedScraper do
  let(:scraper) { described_class.new }

  it 'handles empty pages gracefully' do
    allow(HTTParty).to receive(:get).and_return(
      double(success?: true, body: '<html></html>')
    )

    results = scraper.scrape_all_pages('http://test.com')
    expect(results).to be_empty
  end

  it 'stops on HTTP errors' do
    allow(HTTParty).to receive(:get).and_return(
      double(success?: false, code: 404)
    )

    results = scraper.scrape_all_pages('http://test.com')
    expect(results).to be_empty
  end
end

HTTParty provides excellent flexibility for handling various pagination patterns. The key is identifying the specific pagination mechanism used by your target website and implementing appropriate logic with proper error handling and rate limiting. Remember to always respect the website's terms of service and implement responsible scraping practices.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon