How do I scrape data from websites with infinite scroll using Ruby?

Scraping data from websites with infinite scroll functionality presents unique challenges because content loads dynamically as users scroll down the page. Unlike traditional pagination, infinite scroll websites use JavaScript to continuously fetch and append new content without page refreshes. This comprehensive guide shows you multiple approaches to handle infinite scroll websites using Ruby.

Understanding Infinite Scroll Websites

Infinite scroll websites load content progressively as users reach the bottom of the page. Popular examples include social media feeds (Twitter, Instagram), e-commerce product listings, and news websites. These sites typically use AJAX requests triggered by scroll events to fetch additional content from their APIs.

Method 1: Using Selenium WebDriver with Ruby

Selenium WebDriver is the most popular solution for scraping JavaScript-heavy websites, including those with infinite scroll functionality.

Installation and Setup

First, install the required gems:

gem install selenium-webdriver
gem install nokogiri

Basic Infinite Scroll Implementation

require 'selenium-webdriver'
require 'nokogiri'

class InfiniteScrollScraper
  def initialize(headless: true)
    options = Selenium::WebDriver::Chrome::Options.new
    options.add_argument('--headless') if headless
    options.add_argument('--no-sandbox')
    options.add_argument('--disable-dev-shm-usage')

    @driver = Selenium::WebDriver.for :chrome, options: options
    @wait = Selenium::WebDriver::Wait.new(timeout: 10)
  end

  def scrape_infinite_scroll(url, max_scrolls: 10)
    @driver.navigate.to url

    # Wait for initial content to load
    @wait.until { @driver.find_element(css: 'body') }

    previous_height = @driver.execute_script("return document.body.scrollHeight")
    scroll_count = 0

    while scroll_count < max_scrolls
      # Scroll to bottom of page
      @driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

      # Wait for new content to load
      sleep(2)

      # Check if new content has loaded
      new_height = @driver.execute_script("return document.body.scrollHeight")

      if new_height == previous_height
        puts "No more content to load or reached end of page"
        break
      end

      previous_height = new_height
      scroll_count += 1

      puts "Completed scroll #{scroll_count}/#{max_scrolls}"
    end

    # Extract data using Nokogiri
    page_source = @driver.page_source
    doc = Nokogiri::HTML(page_source)

    extract_data(doc)
  end

  def extract_data(doc)
    # Customize this method based on your target website's structure
    items = []

    doc.css('.item-selector').each do |item|
      data = {
        title: item.css('.title').text.strip,
        description: item.css('.description').text.strip,
        link: item.css('a')&.first&.[]('href')
      }
      items << data
    end

    items
  end

  def close
    @driver.quit
  end
end

# Usage example
scraper = InfiniteScrollScraper.new(headless: false)
data = scraper.scrape_infinite_scroll('https://example.com/infinite-scroll-page', max_scrolls: 5)
puts "Scraped #{data.length} items"
scraper.close

Advanced Scroll Detection

For more robust infinite scroll handling, you can detect specific loading indicators:

def wait_for_content_load(loading_selector = '.loading-spinner')
  begin
    # Wait for loading indicator to appear
    @wait.until { @driver.find_element(css: loading_selector) }

    # Wait for loading indicator to disappear
    @wait.until { @driver.find_elements(css: loading_selector).empty? }
  rescue Selenium::WebDriver::Error::TimeoutError
    # Loading indicator might not be present
    sleep(2)
  end
end

def smart_infinite_scroll(url, target_selector, max_items: 100)
  @driver.navigate.to url

  items_collected = 0
  no_new_content_count = 0

  while items_collected < max_items && no_new_content_count < 3
    current_items = @driver.find_elements(css: target_selector).length

    # Scroll to bottom
    @driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    # Wait for potential new content
    wait_for_content_load('.loading-spinner')

    new_items_count = @driver.find_elements(css: target_selector).length

    if new_items_count > current_items
      items_collected = new_items_count
      no_new_content_count = 0
      puts "Loaded #{new_items_count} items total"
    else
      no_new_content_count += 1
      puts "No new content detected (attempt #{no_new_content_count}/3)"
    end
  end

  @driver.page_source
end

Method 2: Using Watir

Watir provides a more Ruby-like interface for browser automation:

require 'watir'
require 'nokogiri'

class WatirInfiniteScrollScraper
  def initialize(headless: true)
    options = { headless: headless }
    @browser = Watir::Browser.new :chrome, options: options
  end

  def scrape_with_watir(url, scroll_pause_time: 2, max_scrolls: 10)
    @browser.goto url

    # Wait for page to load
    @browser.wait_until { @browser.body.exists? }

    last_height = @browser.execute_script("return document.body.scrollHeight")
    scrolls = 0

    while scrolls < max_scrolls
      # Scroll down
      @browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")

      # Wait for new content
      sleep(scroll_pause_time)

      new_height = @browser.execute_script("return document.body.scrollHeight")

      break if new_height == last_height

      last_height = new_height
      scrolls += 1
    end

    # Parse with Nokogiri
    doc = Nokogiri::HTML(@browser.html)
    extract_data(doc)
  end

  def close
    @browser.close
  end
end

Method 3: Using Ferrum (Lightweight Chrome API)

Ferrum provides a lightweight alternative to Selenium with better performance:

require 'ferrum'
require 'nokogiri'

class FerrumInfiniteScrollScraper
  def initialize(headless: true)
    @browser = Ferrum::Browser.new(
      headless: headless,
      window_size: [1366, 768],
      timeout: 30
    )
  end

  def scrape_with_ferrum(url, max_scrolls: 10)
    @browser.goto(url)

    # Wait for initial load
    @browser.at_css('body')

    scroll_count = 0

    while scroll_count < max_scrolls
      # Get current scroll height
      current_height = @browser.evaluate("document.body.scrollHeight")

      # Scroll to bottom
      @browser.evaluate("window.scrollTo(0, document.body.scrollHeight)")

      # Wait for content to load
      sleep(2)

      # Check if new content loaded
      new_height = @browser.evaluate("document.body.scrollHeight")

      break if new_height == current_height

      scroll_count += 1
      puts "Scroll #{scroll_count}: Height changed from #{current_height} to #{new_height}"
    end

    # Extract data
    html = @browser.body
    doc = Nokogiri::HTML(html)
    extract_data(doc)
  end

  def close
    @browser.quit
  end
end

Handling Different Infinite Scroll Patterns

Pattern 1: Scroll-Based Loading

Most common pattern where content loads when reaching the bottom:

def handle_scroll_based_loading
  loop do
    # Scroll to bottom
    @driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    # Wait and check for new content
    sleep(2)
    new_height = @driver.execute_script("return document.body.scrollHeight")

    break if new_height == @previous_height
    @previous_height = new_height
  end
end

Pattern 2: Load More Button

Some sites use a "Load More" button instead of automatic scrolling:

def handle_load_more_button(button_selector = '.load-more-btn')
  while @driver.find_elements(css: button_selector).any?
    button = @driver.find_element(css: button_selector)

    # Scroll button into view and click
    @driver.execute_script("arguments[0].scrollIntoView();", button)
    button.click

    # Wait for content to load
    sleep(3)
  end
end

Pattern 3: Intersection Observer API

Modern sites often use the Intersection Observer API. You can detect this pattern by monitoring network requests similar to how to handle AJAX requests using Puppeteer:

def handle_intersection_observer
  # Monitor network activity
  @driver.execute_script("""
    window.networkRequests = [];
    const originalFetch = window.fetch;
    window.fetch = function(...args) {
      window.networkRequests.push(args[0]);
      return originalFetch.apply(this, args);
    };
  """)

  # Scroll and monitor requests
  previous_requests = 0

  loop do
    @driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    sleep(2)

    current_requests = @driver.execute_script("return window.networkRequests.length")

    break if current_requests == previous_requests
    previous_requests = current_requests
  end
end

Best Practices and Optimization

1. Implement Proper Error Handling

def robust_infinite_scroll(url, max_retries: 3)
  retries = 0

  begin
    scrape_infinite_scroll(url)
  rescue StandardError => e
    retries += 1
    if retries <= max_retries
      puts "Error occurred: #{e.message}. Retrying (#{retries}/#{max_retries})"
      sleep(5)
      retry
    else
      puts "Max retries reached. Failing gracefully."
      raise e
    end
  end
end

2. Add Rate Limiting

class RateLimitedScraper
  def initialize(delay: 2)
    @delay = delay
    @last_request_time = 0
  end

  def rate_limited_scroll
    elapsed = Time.now - @last_request_time
    sleep(@delay - elapsed) if elapsed < @delay

    # Perform scroll action
    @driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    @last_request_time = Time.now
  end
end

3. Memory Management

def memory_efficient_scraping(url, batch_size: 50)
  items_processed = 0

  while items_processed < target_items
    # Process in batches
    batch_data = scrape_batch(batch_size)

    # Process and store data immediately
    process_batch(batch_data)

    # Clear browser cache periodically
    if items_processed % 200 == 0
      @driver.execute_script("window.location.reload();")
      sleep(5)
    end

    items_processed += batch_data.length
  end
end

Troubleshooting Common Issues

Issue 1: Content Not Loading

If content isn't loading properly, try increasing wait times or implementing more sophisticated waiting strategies like those used in handling browser sessions in Puppeteer:

def wait_for_element_count_change(selector, timeout: 30)
  initial_count = @driver.find_elements(css: selector).length

  @wait.until(timeout: timeout) do
    current_count = @driver.find_elements(css: selector).length
    current_count > initial_count
  end
end

Issue 2: Anti-Bot Detection

Implement human-like scrolling patterns:

def human_like_scroll
  # Random scroll amounts
  scroll_amount = rand(300..800)

  # Variable scroll speed
  scroll_steps = rand(3..7)
  step_size = scroll_amount / scroll_steps

  scroll_steps.times do
    @driver.execute_script("window.scrollBy(0, #{step_size});")
    sleep(rand(0.1..0.3))
  end
end

Performance Comparison

| Method | Memory Usage | Speed | Ease of Use | Stability | |--------|-------------|--------|-------------|-----------| | Selenium | High | Medium | High | High | | Watir | High | Medium | Very High | High | | Ferrum | Low | High | Medium | Medium |

Conclusion

Scraping infinite scroll websites in Ruby requires understanding the underlying JavaScript patterns and choosing the right tools. Selenium WebDriver offers the most comprehensive solution with excellent stability, while Ferrum provides better performance for high-volume scraping. Watir strikes a balance with its Ruby-friendly syntax.

Key considerations include implementing proper error handling, respecting rate limits, managing memory efficiently, and adapting your approach based on the specific infinite scroll implementation of your target website.

Remember to always check the website's robots.txt file and terms of service before scraping, and consider using official APIs when available for better performance and reliability.

Table of contents