When scraping a website with pagination using Nokogiri, you need to identify the pattern that the website uses to navigate through pages. This can be query parameters in the URL, form submissions, or JavaScript-driven content loading. Here's a step-by-step guide on how to handle pagination with Nokogiri:
Step 1: Install Nokogiri and HTTP Libraries
First, ensure that you have Nokogiri installed along with an HTTP library like open-uri
(for simple GET requests) or httparty
(for more complex HTTP interactions):
gem install nokogiri
gem install httparty # if you need more complex HTTP interactions
Step 2: Analyze the Pagination Mechanism
Before you start coding, manually navigate the website and observe how the pagination works. Check for:
- The URL pattern when you navigate to the next page (e.g.,
?page=2
,?offset=20
, etc.). - The presence of "Next" or "Previous" buttons and their corresponding HTML elements.
- JavaScript-based pagination which might require you to use a tool like Selenium to interact with the page.
Step 3: Scrape the First Page
Start by scraping the first page to ensure you can extract the desired data successfully.
require 'nokogiri'
require 'open-uri'
url = 'https://example.com/items?page=1'
doc = Nokogiri::HTML(URI.open(url))
# Extract data from the first page
doc.css('.item').each do |item|
# Process each item here
puts item.text.strip
end
Step 4: Identify the Pagination Links
Find the CSS selector or XPath that can be used to identify the pagination links or buttons. If there's a "Next" button, you can use its link to move to the next page. If the pages are numbered, you might need to increment a counter to navigate through pages.
# Example of finding the 'Next' button
next_page_link = doc.css('a.next').first['href'] if doc.css('a.next').first
Step 5: Loop Through the Pages
Now you can loop through the pages by either following the "Next" link or by incrementing a page number parameter. Here's an example using a page number parameter:
base_url = 'https://example.com/items'
page = 1
loop do
url = "#{base_url}?page=#{page}"
doc = Nokogiri::HTML(URI.open(url))
break if doc.css('.item').empty? # Stop if no items found
doc.css('.item').each do |item|
# Process each item here
puts item.text.strip
end
page += 1
end
If you're following a "Next" link, the loop will look slightly different:
base_url = 'https://example.com'
url = "#{base_url}/items?page=1"
loop do
doc = Nokogiri::HTML(URI.open(url))
break unless (next_page_link = doc.css('a.next').first)
doc.css('.item').each do |item|
# Process each item here
puts item.text.strip
end
url = URI.join(base_url, next_page_link['href']).to_s
end
Step 6: Handle Rate Limiting and Delays
Websites may have rate limits or may block your IP if you make requests too quickly. You should add delays between requests or handle HTTP errors gracefully:
sleep(1) # Delay between requests
Conclusion
When working with pagination, ensure that you respect the website's robots.txt
and terms of service. Some sites may not allow scraping, or may have specific rules about how you can access their content. Always scrape responsibly and consider using official APIs if they are available.