How do I scrape data from websites with infinite scroll using Ruby?
Scraping data from websites with infinite scroll functionality presents unique challenges because content loads dynamically as users scroll down the page. Unlike traditional pagination, infinite scroll websites use JavaScript to continuously fetch and append new content without page refreshes. This comprehensive guide shows you multiple approaches to handle infinite scroll websites using Ruby.
Understanding Infinite Scroll Websites
Infinite scroll websites load content progressively as users reach the bottom of the page. Popular examples include social media feeds (Twitter, Instagram), e-commerce product listings, and news websites. These sites typically use AJAX requests triggered by scroll events to fetch additional content from their APIs.
Method 1: Using Selenium WebDriver with Ruby
Selenium WebDriver is the most popular solution for scraping JavaScript-heavy websites, including those with infinite scroll functionality.
Installation and Setup
First, install the required gems:
gem install selenium-webdriver
gem install nokogiri
Basic Infinite Scroll Implementation
require 'selenium-webdriver'
require 'nokogiri'
class InfiniteScrollScraper
def initialize(headless: true)
options = Selenium::WebDriver::Chrome::Options.new
options.add_argument('--headless') if headless
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
@driver = Selenium::WebDriver.for :chrome, options: options
@wait = Selenium::WebDriver::Wait.new(timeout: 10)
end
def scrape_infinite_scroll(url, max_scrolls: 10)
@driver.navigate.to url
# Wait for initial content to load
@wait.until { @driver.find_element(css: 'body') }
previous_height = @driver.execute_script("return document.body.scrollHeight")
scroll_count = 0
while scroll_count < max_scrolls
# Scroll to bottom of page
@driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait for new content to load
sleep(2)
# Check if new content has loaded
new_height = @driver.execute_script("return document.body.scrollHeight")
if new_height == previous_height
puts "No more content to load or reached end of page"
break
end
previous_height = new_height
scroll_count += 1
puts "Completed scroll #{scroll_count}/#{max_scrolls}"
end
# Extract data using Nokogiri
page_source = @driver.page_source
doc = Nokogiri::HTML(page_source)
extract_data(doc)
end
def extract_data(doc)
# Customize this method based on your target website's structure
items = []
doc.css('.item-selector').each do |item|
data = {
title: item.css('.title').text.strip,
description: item.css('.description').text.strip,
link: item.css('a')&.first&.[]('href')
}
items << data
end
items
end
def close
@driver.quit
end
end
# Usage example
scraper = InfiniteScrollScraper.new(headless: false)
data = scraper.scrape_infinite_scroll('https://example.com/infinite-scroll-page', max_scrolls: 5)
puts "Scraped #{data.length} items"
scraper.close
Advanced Scroll Detection
For more robust infinite scroll handling, you can detect specific loading indicators:
def wait_for_content_load(loading_selector = '.loading-spinner')
begin
# Wait for loading indicator to appear
@wait.until { @driver.find_element(css: loading_selector) }
# Wait for loading indicator to disappear
@wait.until { @driver.find_elements(css: loading_selector).empty? }
rescue Selenium::WebDriver::Error::TimeoutError
# Loading indicator might not be present
sleep(2)
end
end
def smart_infinite_scroll(url, target_selector, max_items: 100)
@driver.navigate.to url
items_collected = 0
no_new_content_count = 0
while items_collected < max_items && no_new_content_count < 3
current_items = @driver.find_elements(css: target_selector).length
# Scroll to bottom
@driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait for potential new content
wait_for_content_load('.loading-spinner')
new_items_count = @driver.find_elements(css: target_selector).length
if new_items_count > current_items
items_collected = new_items_count
no_new_content_count = 0
puts "Loaded #{new_items_count} items total"
else
no_new_content_count += 1
puts "No new content detected (attempt #{no_new_content_count}/3)"
end
end
@driver.page_source
end
Method 2: Using Watir
Watir provides a more Ruby-like interface for browser automation:
require 'watir'
require 'nokogiri'
class WatirInfiniteScrollScraper
def initialize(headless: true)
options = { headless: headless }
@browser = Watir::Browser.new :chrome, options: options
end
def scrape_with_watir(url, scroll_pause_time: 2, max_scrolls: 10)
@browser.goto url
# Wait for page to load
@browser.wait_until { @browser.body.exists? }
last_height = @browser.execute_script("return document.body.scrollHeight")
scrolls = 0
while scrolls < max_scrolls
# Scroll down
@browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait for new content
sleep(scroll_pause_time)
new_height = @browser.execute_script("return document.body.scrollHeight")
break if new_height == last_height
last_height = new_height
scrolls += 1
end
# Parse with Nokogiri
doc = Nokogiri::HTML(@browser.html)
extract_data(doc)
end
def close
@browser.close
end
end
Method 3: Using Ferrum (Lightweight Chrome API)
Ferrum provides a lightweight alternative to Selenium with better performance:
require 'ferrum'
require 'nokogiri'
class FerrumInfiniteScrollScraper
def initialize(headless: true)
@browser = Ferrum::Browser.new(
headless: headless,
window_size: [1366, 768],
timeout: 30
)
end
def scrape_with_ferrum(url, max_scrolls: 10)
@browser.goto(url)
# Wait for initial load
@browser.at_css('body')
scroll_count = 0
while scroll_count < max_scrolls
# Get current scroll height
current_height = @browser.evaluate("document.body.scrollHeight")
# Scroll to bottom
@browser.evaluate("window.scrollTo(0, document.body.scrollHeight)")
# Wait for content to load
sleep(2)
# Check if new content loaded
new_height = @browser.evaluate("document.body.scrollHeight")
break if new_height == current_height
scroll_count += 1
puts "Scroll #{scroll_count}: Height changed from #{current_height} to #{new_height}"
end
# Extract data
html = @browser.body
doc = Nokogiri::HTML(html)
extract_data(doc)
end
def close
@browser.quit
end
end
Handling Different Infinite Scroll Patterns
Pattern 1: Scroll-Based Loading
Most common pattern where content loads when reaching the bottom:
def handle_scroll_based_loading
loop do
# Scroll to bottom
@driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait and check for new content
sleep(2)
new_height = @driver.execute_script("return document.body.scrollHeight")
break if new_height == @previous_height
@previous_height = new_height
end
end
Pattern 2: Load More Button
Some sites use a "Load More" button instead of automatic scrolling:
def handle_load_more_button(button_selector = '.load-more-btn')
while @driver.find_elements(css: button_selector).any?
button = @driver.find_element(css: button_selector)
# Scroll button into view and click
@driver.execute_script("arguments[0].scrollIntoView();", button)
button.click
# Wait for content to load
sleep(3)
end
end
Pattern 3: Intersection Observer API
Modern sites often use the Intersection Observer API. You can detect this pattern by monitoring network requests similar to how to handle AJAX requests using Puppeteer:
def handle_intersection_observer
# Monitor network activity
@driver.execute_script("""
window.networkRequests = [];
const originalFetch = window.fetch;
window.fetch = function(...args) {
window.networkRequests.push(args[0]);
return originalFetch.apply(this, args);
};
""")
# Scroll and monitor requests
previous_requests = 0
loop do
@driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
sleep(2)
current_requests = @driver.execute_script("return window.networkRequests.length")
break if current_requests == previous_requests
previous_requests = current_requests
end
end
Best Practices and Optimization
1. Implement Proper Error Handling
def robust_infinite_scroll(url, max_retries: 3)
retries = 0
begin
scrape_infinite_scroll(url)
rescue StandardError => e
retries += 1
if retries <= max_retries
puts "Error occurred: #{e.message}. Retrying (#{retries}/#{max_retries})"
sleep(5)
retry
else
puts "Max retries reached. Failing gracefully."
raise e
end
end
end
2. Add Rate Limiting
class RateLimitedScraper
def initialize(delay: 2)
@delay = delay
@last_request_time = 0
end
def rate_limited_scroll
elapsed = Time.now - @last_request_time
sleep(@delay - elapsed) if elapsed < @delay
# Perform scroll action
@driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
@last_request_time = Time.now
end
end
3. Memory Management
def memory_efficient_scraping(url, batch_size: 50)
items_processed = 0
while items_processed < target_items
# Process in batches
batch_data = scrape_batch(batch_size)
# Process and store data immediately
process_batch(batch_data)
# Clear browser cache periodically
if items_processed % 200 == 0
@driver.execute_script("window.location.reload();")
sleep(5)
end
items_processed += batch_data.length
end
end
Troubleshooting Common Issues
Issue 1: Content Not Loading
If content isn't loading properly, try increasing wait times or implementing more sophisticated waiting strategies like those used in handling browser sessions in Puppeteer:
def wait_for_element_count_change(selector, timeout: 30)
initial_count = @driver.find_elements(css: selector).length
@wait.until(timeout: timeout) do
current_count = @driver.find_elements(css: selector).length
current_count > initial_count
end
end
Issue 2: Anti-Bot Detection
Implement human-like scrolling patterns:
def human_like_scroll
# Random scroll amounts
scroll_amount = rand(300..800)
# Variable scroll speed
scroll_steps = rand(3..7)
step_size = scroll_amount / scroll_steps
scroll_steps.times do
@driver.execute_script("window.scrollBy(0, #{step_size});")
sleep(rand(0.1..0.3))
end
end
Performance Comparison
| Method | Memory Usage | Speed | Ease of Use | Stability | |--------|-------------|--------|-------------|-----------| | Selenium | High | Medium | High | High | | Watir | High | Medium | Very High | High | | Ferrum | Low | High | Medium | Medium |
Conclusion
Scraping infinite scroll websites in Ruby requires understanding the underlying JavaScript patterns and choosing the right tools. Selenium WebDriver offers the most comprehensive solution with excellent stability, while Ferrum provides better performance for high-volume scraping. Watir strikes a balance with its Ruby-friendly syntax.
Key considerations include implementing proper error handling, respecting rate limits, managing memory efficiently, and adapting your approach based on the specific infinite scroll implementation of your target website.
Remember to always check the website's robots.txt file and terms of service before scraping, and consider using official APIs when available for better performance and reliability.