How to Handle Pagination When Scraping Multiple Pages with Mechanize

Pagination is one of the most common challenges in web scraping, as many websites split their content across multiple pages to improve loading times and user experience. Ruby's Mechanize library provides excellent tools for handling various pagination patterns automatically and efficiently.

Understanding Pagination Types

Before diving into implementation, it's important to understand the different types of pagination you'll encounter:

1. Link-Based Pagination

This is the most common pattern where "Next" or "Page 2" links are provided:

<a href="/page/2">Next</a>
<a href="/products?page=3">Page 3</a>

2. URL Pattern Pagination

Pages follow a predictable URL structure: https://example.com/products?page=1 https://example.com/products?page=2 https://example.com/products?page=3

3. Form-Based Pagination

Pagination is controlled through form submissions or POST requests.

Basic Pagination Handling with Mechanize

Here's a fundamental approach to handle link-based pagination:

require 'mechanize'

agent = Mechanize.new
page = agent.get('https://example.com/products')

loop do
  # Extract data from current page
  page.search('.product').each do |product|
    title = product.at('.title').text.strip
    price = product.at('.price').text.strip
    puts "#{title}: #{price}"
  end

  # Look for next page link
  next_link = page.link_with(text: /next/i) || page.link_with(text: /→/)

  break unless next_link

  puts "Moving to next page..."
  page = next_link.click

  # Add delay to be respectful
  sleep(1)
end

Advanced Pagination Patterns

Handling Multiple Next Link Variations

Different websites use various text patterns for pagination links:

def find_next_link(page)
  # Try different common patterns
  patterns = [
    /next/i,
    /more/i,
    /continue/i,
    /→/,
    />/,
    /page\s*\d+/i
  ]

  patterns.each do |pattern|
    link = page.link_with(text: pattern)
    return link if link
  end

  # Try finding by href pattern
  page.links.find { |link| link.href =~ /page=\d+/ && 
                          link.href.match(/page=(\d+)/)[1].to_i > current_page_number(page) }
end

def current_page_number(page)
  # Extract current page number from URL or page content
  if page.uri.query =~ /page=(\d+)/
    $1.to_i
  else
    1
  end
end

URL Pattern-Based Pagination

When pagination follows a predictable URL pattern:

require 'mechanize'

agent = Mechanize.new
base_url = 'https://example.com/products'
page_num = 1
max_pages = 50  # Set a reasonable limit

loop do
  url = "#{base_url}?page=#{page_num}"

  begin
    page = agent.get(url)

    # Check if page has content (not a 404 or empty)
    products = page.search('.product')
    break if products.empty?

    puts "Scraping page #{page_num}"

    products.each do |product|
      # Extract product data
      title = product.at('.title')&.text&.strip
      price = product.at('.price')&.text&.strip
      next unless title && price

      puts "#{title}: #{price}"
    end

    page_num += 1
    break if page_num > max_pages

    sleep(1)  # Rate limiting

  rescue Mechanize::ResponseCodeError => e
    puts "Page #{page_num} not found: #{e.message}"
    break
  end
end

Form-Based Pagination

Some sites use forms for pagination control:

require 'mechanize'

agent = Mechanize.new
page = agent.get('https://example.com/search')

# Fill initial search form if needed
search_form = page.form_with(name: 'search')
if search_form
  search_form['query'] = 'your search term'
  page = agent.submit(search_form)
end

loop do
  # Extract data from current page
  extract_data_from_page(page)

  # Look for pagination form
  pagination_form = page.forms.find do |form|
    form.buttons.any? { |btn| btn.value =~ /next/i }
  end

  break unless pagination_form

  # Find and click next button
  next_button = pagination_form.buttons.find { |btn| btn.value =~ /next/i }
  break unless next_button

  puts "Submitting pagination form..."
  page = agent.submit(pagination_form, next_button)

  sleep(1)
end

Robust Error Handling and Rate Limiting

Professional pagination handling requires proper error management:

require 'mechanize'

class PaginationScraper
  def initialize(start_url)
    @agent = Mechanize.new
    @agent.user_agent_alias = 'Windows Mozilla'
    @start_url = start_url
    @max_retries = 3
    @delay = 1
  end

  def scrape_all_pages
    page = @agent.get(@start_url)
    page_count = 0

    loop do
      page_count += 1
      puts "Processing page #{page_count}"

      begin
        extract_data_from_page(page)

        # Find next page with retry logic
        next_page = find_next_page_with_retry(page)
        break unless next_page

        page = next_page
        rate_limit_delay

      rescue StandardError => e
        puts "Error on page #{page_count}: #{e.message}"
        break if page_count > 100  # Safety limit
      end
    end

    puts "Scraped #{page_count} pages total"
  end

  private

  def find_next_page_with_retry(page)
    @max_retries.times do |attempt|
      begin
        return find_next_page(page)
      rescue Mechanize::ResponseCodeError => e
        puts "Attempt #{attempt + 1} failed: #{e.message}"
        sleep(@delay * (attempt + 1))  # Exponential backoff
      end
    end
    nil
  end

  def find_next_page(page)
    # Multiple strategies for finding next page
    next_link = page.link_with(text: /next/i) ||
                page.link_with(href: /page=#{current_page_number(page) + 1}/) ||
                page.at('a[rel="next"]')

    return next_link.click if next_link
    nil
  end

  def extract_data_from_page(page)
    page.search('.item').each do |item|
      data = {
        title: item.at('.title')&.text&.strip,
        price: item.at('.price')&.text&.strip,
        url: item.at('a')&.[]('href')
      }

      # Process data (save to database, CSV, etc.)
      process_item(data) if data[:title]
    end
  end

  def rate_limit_delay
    sleep(@delay + rand(0.5))  # Add randomization
  end

  def current_page_number(page)
    if page.uri.query =~ /page=(\d+)/
      $1.to_i
    else
      1
    end
  end

  def process_item(data)
    puts "#{data[:title]} - #{data[:price]}"
    # Add your data processing logic here
  end
end

# Usage
scraper = PaginationScraper.new('https://example.com/products')
scraper.scrape_all_pages

Handling JavaScript-Heavy Pagination

When dealing with sites that load pagination via JavaScript, you might need to combine Mechanize with headless browser solutions. For complex JavaScript pagination scenarios, consider using Puppeteer for handling dynamic content or similar tools.

Performance Optimization Techniques

Concurrent Page Processing

For better performance, you can process multiple pages concurrently:

require 'mechanize'
require 'thread'

class ConcurrentPaginationScraper
  def initialize(base_url, max_workers: 5)
    @base_url = base_url
    @max_workers = max_workers
    @queue = Queue.new
    @results = Queue.new
  end

  def scrape_with_workers
    # Discover all page URLs first
    discover_all_pages

    # Create worker threads
    workers = []
    @max_workers.times do
      workers << Thread.new { worker_thread }
    end

    # Wait for completion
    workers.each(&:join)

    # Process results
    process_all_results
  end

  private

  def discover_all_pages
    agent = Mechanize.new
    page = agent.get(@base_url)
    page_urls = [@base_url]

    # Collect all pagination URLs
    while (next_link = page.link_with(text: /next/i))
      page = next_link.click
      page_urls << page.uri.to_s
    end

    page_urls.each { |url| @queue << url }
    @max_workers.times { @queue << :stop }
  end

  def worker_thread
    agent = Mechanize.new

    while (url = @queue.pop) != :stop
      begin
        page = agent.get(url)
        data = extract_data_from_page(page)
        @results << { url: url, data: data }
        sleep(0.5)  # Rate limiting per worker
      rescue => e
        @results << { url: url, error: e.message }
      end
    end
  end
end

Best Practices for Pagination Scraping

1. Implement Proper Rate Limiting

Always add delays between requests to avoid overwhelming the server:

# Variable delay with randomization
def smart_delay
  base_delay = 1
  random_factor = rand(0.5..1.5)
  sleep(base_delay * random_factor)
end

2. Set Reasonable Limits

Prevent infinite loops with safety mechanisms:

MAX_PAGES = 1000
page_count = 0

loop do
  break if page_count >= MAX_PAGES
  # Pagination logic here
  page_count += 1
end

3. Handle Edge Cases

Account for various pagination implementations:

def safe_pagination(page)
  # Check for disabled next buttons
  next_button = page.at('a.next:not(.disabled)')
  return nil if next_button.nil?

  # Verify the link is actually different
  current_url = page.uri.to_s
  next_url = next_button['href']

  return nil if current_url == next_url

  next_button
end

Troubleshooting Common Issues

Session Management

Some sites require maintaining sessions across pagination:

agent = Mechanize.new
agent.cookie_jar.clear!  # Start fresh

# Login if required
login_page = agent.get('https://example.com/login')
# Handle login...

# Now pagination with maintained session
page = agent.get('https://example.com/protected/data')
# Continue with pagination...

Handling Dynamic URLs

For sites with complex URL structures:

def normalize_pagination_url(base_url, page_num)
  uri = URI.parse(base_url)
  params = URI.decode_www_form(uri.query || '')

  # Update or add page parameter
  params.delete_if { |key, _| key == 'page' }
  params << ['page', page_num.to_s]

  uri.query = URI.encode_www_form(params)
  uri.to_s
end

Conclusion

Effective pagination handling with Mechanize requires understanding the specific patterns used by your target website and implementing robust error handling and rate limiting. Whether dealing with simple link-based navigation or complex form submissions, the key is to build flexible, maintainable scrapers that can adapt to different pagination implementations.

For sites with heavy JavaScript requirements, consider integrating Mechanize with browser automation tools for dynamic content handling to ensure comprehensive data extraction across all pagination scenarios.

Table of contents