How to Handle Dynamic Content That Loads After Page Load in Ruby

Modern web applications heavily rely on JavaScript to load content dynamically after the initial page load. This creates a significant challenge for traditional web scraping approaches that only capture the initial HTML. In Ruby, handling dynamic content requires specialized tools and techniques that can execute JavaScript and wait for content to appear.

Understanding Dynamic Content

Dynamic content refers to HTML elements, data, or entire sections of a webpage that are loaded asynchronously through JavaScript, AJAX requests, or other client-side technologies. This content is not present in the initial HTML response and becomes available only after the browser executes JavaScript code.

Common examples include: - Infinite scroll feeds on social media platforms - Search results that load via AJAX - Product listings that appear after filtering - Comments sections loaded dynamically - Single Page Applications (SPAs) that render content client-side

Ruby Solutions for Dynamic Content

1. Using Selenium WebDriver

Selenium WebDriver is the most popular solution for handling dynamic content in Ruby. It controls a real browser instance and can execute JavaScript.

require 'selenium-webdriver'

# Configure Chrome driver with headless mode
options = Selenium::WebDriver::Chrome::Options.new
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

driver = Selenium::WebDriver.for(:chrome, options: options)

begin
  # Navigate to the page
  driver.get('https://example.com/dynamic-content')

  # Wait for specific element to appear
  wait = Selenium::WebDriver::Wait.new(timeout: 10)
  dynamic_element = wait.until do
    driver.find_element(css: '.dynamic-content')
  end

  # Extract the content
  content = dynamic_element.text
  puts content

ensure
  driver.quit
end

2. Advanced Waiting Strategies

Different types of dynamic content require different waiting strategies:

require 'selenium-webdriver'

class DynamicContentScraper
  def initialize
    options = Selenium::WebDriver::Chrome::Options.new
    options.add_argument('--headless')
    @driver = Selenium::WebDriver.for(:chrome, options: options)
    @wait = Selenium::WebDriver::Wait.new(timeout: 15)
  end

  def wait_for_element_present(selector)
    @wait.until { @driver.find_element(css: selector) }
  end

  def wait_for_element_visible(selector)
    @wait.until do
      element = @driver.find_element(css: selector)
      element.displayed?
    end
  end

  def wait_for_text_to_appear(selector, expected_text)
    @wait.until do
      element = @driver.find_element(css: selector)
      element.text.include?(expected_text)
    end
  end

  def wait_for_ajax_completion
    @wait.until do
      @driver.execute_script('return jQuery.active == 0')
    end
  end

  def scrape_infinite_scroll
    @driver.get('https://example.com/infinite-scroll')

    last_height = @driver.execute_script('return document.body.scrollHeight')

    loop do
      # Scroll to bottom
      @driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')

      # Wait for new content to load
      sleep(2)

      new_height = @driver.execute_script('return document.body.scrollHeight')
      break if new_height == last_height

      last_height = new_height
    end

    # Extract all loaded content
    items = @driver.find_elements(css: '.scroll-item')
    items.map(&:text)
  end

  def close
    @driver.quit
  end
end

3. Using Cuprite for Faster Performance

Cuprite is a Ruby driver for Chrome DevTools Protocol, offering better performance than Selenium:

require 'cuprite'

browser = Cuprite::Browser.new(
  window_size: [1200, 800],
  headless: true,
  timeout: 30
)

page = browser.create_page

begin
  page.visit('https://example.com/spa-application')

  # Wait for specific content to load
  page.wait_for_css('.main-content', timeout: 10)

  # Wait for AJAX requests to complete
  page.wait_for_network_idle(timeout: 5)

  # Extract content
  content = page.text('.dynamic-section')
  puts content

ensure
  browser.quit
end

4. Handling AJAX Requests

For applications that load data via AJAX, you can intercept network requests:

require 'selenium-webdriver'
require 'json'

# Enable logging to capture network requests
caps = Selenium::WebDriver::Remote::Capabilities.chrome(
  loggingPrefs: { browser: 'ALL', performance: 'ALL' }
)

driver = Selenium::WebDriver.for(:chrome, desired_capabilities: caps)

driver.get('https://api-driven-site.com')

# Trigger AJAX request
search_box = driver.find_element(name: 'search')
search_box.send_keys('ruby scraping')
search_box.submit

# Wait for results
wait = Selenium::WebDriver::Wait.new(timeout: 10)
wait.until { driver.find_elements(css: '.search-result').length > 0 }

# Extract network logs to find API calls
logs = driver.manage.logs.get(:performance)
api_calls = logs.select do |log|
  message = JSON.parse(log.message)
  message['message']['method'] == 'Network.responseReceived'
end

api_calls.each do |call|
  response_data = JSON.parse(call.message)
  url = response_data['message']['params']['response']['url']
  puts "API Call: #{url}" if url.include?('api')
end

driver.quit

5. Ruby with Headless Chrome via Ferrum

Ferrum provides a high-level API for Chrome DevTools Protocol:

require 'ferrum'

browser = Ferrum::Browser.new(
  headless: true,
  window_size: [1024, 768],
  timeout: 30
)

begin
  browser.visit('https://example.com/react-app')

  # Wait for React components to mount
  browser.at_css('#root') # Wait for root element
  sleep(2) # Additional wait for React rendering

  # Handle lazy loading
  browser.execute <<~JS
    // Scroll to trigger lazy loading
    window.scrollTo(0, document.body.scrollHeight / 2);
  JS

  # Wait for lazy-loaded content
  browser.at_css('.lazy-content', wait: 5)

  # Extract data
  items = browser.css('.product-item').map do |element|
    {
      title: element.at_css('.title')&.text,
      price: element.at_css('.price')&.text,
      url: element.at_css('a')&.attribute('href')
    }
  end

  puts items.inspect

ensure
  browser.quit
end

Best Practices for Dynamic Content Scraping

1. Implement Robust Error Handling

class RobustScraper
  MAX_RETRIES = 3

  def scrape_with_retry(url)
    retries = 0

    begin
      driver.get(url)
      wait_for_content_load
      extract_data
    rescue Selenium::WebDriver::Error::TimeoutError => e
      retries += 1
      if retries <= MAX_RETRIES
        puts "Timeout error, retrying... (#{retries}/#{MAX_RETRIES})"
        sleep(2)
        retry
      else
        raise "Failed after #{MAX_RETRIES} retries: #{e.message}"
      end
    rescue => e
      puts "Unexpected error: #{e.message}"
      nil
    end
  end

  private

  def wait_for_content_load
    wait = Selenium::WebDriver::Wait.new(timeout: 15)
    wait.until do
      driver.execute_script('return document.readyState') == 'complete' &&
      driver.find_elements(css: '.loading-spinner').empty?
    end
  end
end

2. Optimize Performance

# Use page caching for repeated requests
class CachedScraper
  def initialize
    @cache = {}
    setup_driver
  end

  def scrape_page(url, cache_key = nil)
    cache_key ||= url

    return @cache[cache_key] if @cache[cache_key]

    @driver.get(url)
    wait_for_dynamic_content

    data = extract_data
    @cache[cache_key] = data if data

    data
  end

  private

  def setup_driver
    options = Selenium::WebDriver::Chrome::Options.new
    options.add_argument('--headless')
    options.add_argument('--disable-images') # Faster loading
    options.add_argument('--disable-javascript') if static_content_only?

    @driver = Selenium::WebDriver.for(:chrome, options: options)
  end
end

3. Monitor Network Activity

When dealing with AJAX requests and dynamic loading, monitoring network activity helps ensure all data has loaded:

def wait_for_network_idle(timeout = 10)
  start_time = Time.now
  last_request_time = Time.now

  # Monitor network requests
  driver.execute_script(<<~JS)
    window.activeRequests = 0;
    (function() {
      var originalFetch = window.fetch;
      window.fetch = function() {
        window.activeRequests++;
        return originalFetch.apply(this, arguments)
          .finally(() => window.activeRequests--);
      };
    })();
  JS

  loop do
    active_requests = driver.execute_script('return window.activeRequests || 0')

    if active_requests == 0
      # No active requests, wait a bit more to be sure
      sleep(1)
      break if driver.execute_script('return window.activeRequests || 0') == 0
    end

    break if Time.now - start_time > timeout
    sleep(0.5)
  end
end

Common Challenges and Solutions

Handling Single Page Applications

SPAs require special handling as they often render content entirely through JavaScript:

def scrape_spa(url, route_selector = nil)
  driver.get(url)

  # Wait for initial app loading
  wait = Selenium::WebDriver::Wait.new(timeout: 20)
  wait.until { driver.find_element(css: '#app, [data-react-root], .vue-app') }

  # Wait for route-specific content if specified
  if route_selector
    wait.until { driver.find_element(css: route_selector) }
  end

  # Additional wait for async data loading
  sleep(3)

  extract_spa_data
end

Dealing with Infinite Scroll

def scrape_infinite_scroll(max_scrolls = 10)
  scroll_count = 0
  previous_content_length = 0

  while scroll_count < max_scrolls
    # Get current content count
    current_content = driver.find_elements(css: '.content-item')

    break if current_content.length == previous_content_length

    previous_content_length = current_content.length

    # Scroll to bottom
    driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')

    # Wait for new content
    sleep(2)
    scroll_count += 1
  end

  driver.find_elements(css: '.content-item').map(&:text)
end

Working with React and Vue Applications

Modern frontend frameworks often use virtual DOM and require specific waiting strategies:

def wait_for_react_app
  # Wait for React to mount
  wait = Selenium::WebDriver::Wait.new(timeout: 15)
  wait.until do
    driver.execute_script(<<~JS)
      return window.React && 
             document.querySelector('[data-reactroot]') &&
             !document.querySelector('.loading, .spinner');
    JS
  end
end

def wait_for_vue_app
  # Wait for Vue.js to initialize
  wait = Selenium::WebDriver::Wait.new(timeout: 15)
  wait.until do
    driver.execute_script(<<~JS)
      return window.Vue && 
             document.querySelector('#app').__vue__ &&
             !document.querySelector('.v-progress-circular');
    JS
  end
end

Performance Optimization Techniques

1. Disable Unnecessary Resources

def setup_optimized_driver
  options = Selenium::WebDriver::Chrome::Options.new
  options.add_argument('--headless')
  options.add_argument('--no-sandbox')
  options.add_argument('--disable-dev-shm-usage')
  options.add_argument('--disable-gpu')
  options.add_argument('--disable-images')
  options.add_argument('--disable-plugins')
  options.add_argument('--disable-extensions')
  options.add_argument('--window-size=1280,720')

  # Block specific resource types
  prefs = {
    'profile.managed_default_content_settings.images' => 2,
    'profile.managed_default_content_settings.stylesheets' => 2
  }
  options.add_preference(:prefs, prefs)

  Selenium::WebDriver.for(:chrome, options: options)
end

2. Use Connection Pooling

class WebDriverPool
  def initialize(size = 5)
    @pool = Queue.new
    size.times { @pool << create_driver }
  end

  def with_driver
    driver = @pool.pop
    begin
      yield driver
    ensure
      @pool << driver
    end
  end

  private

  def create_driver
    # Driver creation logic here
    setup_optimized_driver
  end
end

# Usage
pool = WebDriverPool.new(3)
results = []

urls.each do |url|
  pool.with_driver do |driver|
    results << scrape_page(driver, url)
  end
end

Error Handling and Debugging

Advanced Error Recovery

class ResilieScraper
  def initialize
    @driver = setup_driver
    @retry_count = 0
    @max_retries = 3
  end

  def scrape_with_recovery(url)
    begin
      navigate_and_wait(url)
      extract_content
    rescue Selenium::WebDriver::Error::TimeoutError
      handle_timeout_error
    rescue Selenium::WebDriver::Error::NoSuchElementError
      handle_missing_element
    rescue Selenium::WebDriver::Error::StaleElementReferenceError
      handle_stale_element
    rescue Net::ReadTimeout
      handle_network_timeout
    end
  end

  private

  def handle_timeout_error
    if @retry_count < @max_retries
      @retry_count += 1
      puts "Timeout occurred, retrying... (#{@retry_count}/#{@max_retries})"
      sleep(2)
      retry
    else
      raise "Max retries exceeded due to timeouts"
    end
  end

  def handle_missing_element
    # Check if page loaded correctly
    current_url = @driver.current_url
    page_source = @driver.page_source

    if page_source.include?('404') || page_source.include?('error')
      raise "Page not found or error page detected"
    end

    # Wait longer for dynamic content
    sleep(5)
    retry
  end

  def handle_stale_element
    # Re-find elements and retry operation
    sleep(1)
    retry
  end

  def handle_network_timeout
    # Restart driver if network issues persist
    @driver.quit
    @driver = setup_driver
    retry if @retry_count < @max_retries
  end
end

Conclusion

Handling dynamic content in Ruby requires the right tools and strategies. Selenium WebDriver remains the most versatile option, while newer alternatives like Cuprite and Ferrum offer better performance for specific use cases. The key is to understand the loading patterns of your target website and implement appropriate waiting strategies.

Remember to always respect website terms of service, implement proper error handling, and consider the performance impact of running headless browsers. For complex scenarios involving timeouts and error handling, robust retry mechanisms are essential for reliable scraping operations.

By combining these techniques with proper monitoring and optimization, you can effectively scrape even the most dynamic modern web applications using Ruby.

Table of contents