How do I handle pagination when scraping with Nokogiri?

When scraping a website with pagination using Nokogiri, you need to identify the pattern that the website uses to navigate through pages. This can be query parameters in the URL, form submissions, or JavaScript-driven content loading. Here's a step-by-step guide on how to handle pagination with Nokogiri:

Step 1: Install Nokogiri and HTTP Libraries

First, ensure that you have Nokogiri installed along with an HTTP library like open-uri (for simple GET requests) or httparty (for more complex HTTP interactions):

gem install nokogiri
gem install httparty # if you need more complex HTTP interactions

Step 2: Analyze the Pagination Mechanism

Before you start coding, manually navigate the website and observe how the pagination works. Check for:

  • The URL pattern when you navigate to the next page (e.g., ?page=2, ?offset=20, etc.).
  • The presence of "Next" or "Previous" buttons and their corresponding HTML elements.
  • JavaScript-based pagination which might require you to use a tool like Selenium to interact with the page.

Step 3: Scrape the First Page

Start by scraping the first page to ensure you can extract the desired data successfully.

require 'nokogiri'
require 'open-uri'

url = 'https://example.com/items?page=1'
doc = Nokogiri::HTML(URI.open(url))

# Extract data from the first page
doc.css('.item').each do |item|
  # Process each item here
  puts item.text.strip
end

Step 4: Identify the Pagination Links

Find the CSS selector or XPath that can be used to identify the pagination links or buttons. If there's a "Next" button, you can use its link to move to the next page. If the pages are numbered, you might need to increment a counter to navigate through pages.

# Example of finding the 'Next' button
next_page_link = doc.css('a.next').first['href'] if doc.css('a.next').first

Step 5: Loop Through the Pages

Now you can loop through the pages by either following the "Next" link or by incrementing a page number parameter. Here's an example using a page number parameter:

base_url = 'https://example.com/items'
page = 1

loop do
  url = "#{base_url}?page=#{page}"
  doc = Nokogiri::HTML(URI.open(url))

  break if doc.css('.item').empty? # Stop if no items found

  doc.css('.item').each do |item|
    # Process each item here
    puts item.text.strip
  end

  page += 1
end

If you're following a "Next" link, the loop will look slightly different:

base_url = 'https://example.com'
url = "#{base_url}/items?page=1"

loop do
  doc = Nokogiri::HTML(URI.open(url))

  break unless (next_page_link = doc.css('a.next').first)

  doc.css('.item').each do |item|
    # Process each item here
    puts item.text.strip
  end

  url = URI.join(base_url, next_page_link['href']).to_s
end

Step 6: Handle Rate Limiting and Delays

Websites may have rate limits or may block your IP if you make requests too quickly. You should add delays between requests or handle HTTP errors gracefully:

sleep(1) # Delay between requests

Conclusion

When working with pagination, ensure that you respect the website's robots.txt and terms of service. Some sites may not allow scraping, or may have specific rules about how you can access their content. Always scrape responsibly and consider using official APIs if they are available.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon