How to Handle Dynamic Content That Loads After Page Load in Ruby
Modern web applications heavily rely on JavaScript to load content dynamically after the initial page load. This creates a significant challenge for traditional web scraping approaches that only capture the initial HTML. In Ruby, handling dynamic content requires specialized tools and techniques that can execute JavaScript and wait for content to appear.
Understanding Dynamic Content
Dynamic content refers to HTML elements, data, or entire sections of a webpage that are loaded asynchronously through JavaScript, AJAX requests, or other client-side technologies. This content is not present in the initial HTML response and becomes available only after the browser executes JavaScript code.
Common examples include: - Infinite scroll feeds on social media platforms - Search results that load via AJAX - Product listings that appear after filtering - Comments sections loaded dynamically - Single Page Applications (SPAs) that render content client-side
Ruby Solutions for Dynamic Content
1. Using Selenium WebDriver
Selenium WebDriver is the most popular solution for handling dynamic content in Ruby. It controls a real browser instance and can execute JavaScript.
require 'selenium-webdriver'
# Configure Chrome driver with headless mode
options = Selenium::WebDriver::Chrome::Options.new
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
driver = Selenium::WebDriver.for(:chrome, options: options)
begin
# Navigate to the page
driver.get('https://example.com/dynamic-content')
# Wait for specific element to appear
wait = Selenium::WebDriver::Wait.new(timeout: 10)
dynamic_element = wait.until do
driver.find_element(css: '.dynamic-content')
end
# Extract the content
content = dynamic_element.text
puts content
ensure
driver.quit
end
2. Advanced Waiting Strategies
Different types of dynamic content require different waiting strategies:
require 'selenium-webdriver'
class DynamicContentScraper
def initialize
options = Selenium::WebDriver::Chrome::Options.new
options.add_argument('--headless')
@driver = Selenium::WebDriver.for(:chrome, options: options)
@wait = Selenium::WebDriver::Wait.new(timeout: 15)
end
def wait_for_element_present(selector)
@wait.until { @driver.find_element(css: selector) }
end
def wait_for_element_visible(selector)
@wait.until do
element = @driver.find_element(css: selector)
element.displayed?
end
end
def wait_for_text_to_appear(selector, expected_text)
@wait.until do
element = @driver.find_element(css: selector)
element.text.include?(expected_text)
end
end
def wait_for_ajax_completion
@wait.until do
@driver.execute_script('return jQuery.active == 0')
end
end
def scrape_infinite_scroll
@driver.get('https://example.com/infinite-scroll')
last_height = @driver.execute_script('return document.body.scrollHeight')
loop do
# Scroll to bottom
@driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
# Wait for new content to load
sleep(2)
new_height = @driver.execute_script('return document.body.scrollHeight')
break if new_height == last_height
last_height = new_height
end
# Extract all loaded content
items = @driver.find_elements(css: '.scroll-item')
items.map(&:text)
end
def close
@driver.quit
end
end
3. Using Cuprite for Faster Performance
Cuprite is a Ruby driver for Chrome DevTools Protocol, offering better performance than Selenium:
require 'cuprite'
browser = Cuprite::Browser.new(
window_size: [1200, 800],
headless: true,
timeout: 30
)
page = browser.create_page
begin
page.visit('https://example.com/spa-application')
# Wait for specific content to load
page.wait_for_css('.main-content', timeout: 10)
# Wait for AJAX requests to complete
page.wait_for_network_idle(timeout: 5)
# Extract content
content = page.text('.dynamic-section')
puts content
ensure
browser.quit
end
4. Handling AJAX Requests
For applications that load data via AJAX, you can intercept network requests:
require 'selenium-webdriver'
require 'json'
# Enable logging to capture network requests
caps = Selenium::WebDriver::Remote::Capabilities.chrome(
loggingPrefs: { browser: 'ALL', performance: 'ALL' }
)
driver = Selenium::WebDriver.for(:chrome, desired_capabilities: caps)
driver.get('https://api-driven-site.com')
# Trigger AJAX request
search_box = driver.find_element(name: 'search')
search_box.send_keys('ruby scraping')
search_box.submit
# Wait for results
wait = Selenium::WebDriver::Wait.new(timeout: 10)
wait.until { driver.find_elements(css: '.search-result').length > 0 }
# Extract network logs to find API calls
logs = driver.manage.logs.get(:performance)
api_calls = logs.select do |log|
message = JSON.parse(log.message)
message['message']['method'] == 'Network.responseReceived'
end
api_calls.each do |call|
response_data = JSON.parse(call.message)
url = response_data['message']['params']['response']['url']
puts "API Call: #{url}" if url.include?('api')
end
driver.quit
5. Ruby with Headless Chrome via Ferrum
Ferrum provides a high-level API for Chrome DevTools Protocol:
require 'ferrum'
browser = Ferrum::Browser.new(
headless: true,
window_size: [1024, 768],
timeout: 30
)
begin
browser.visit('https://example.com/react-app')
# Wait for React components to mount
browser.at_css('#root') # Wait for root element
sleep(2) # Additional wait for React rendering
# Handle lazy loading
browser.execute <<~JS
// Scroll to trigger lazy loading
window.scrollTo(0, document.body.scrollHeight / 2);
JS
# Wait for lazy-loaded content
browser.at_css('.lazy-content', wait: 5)
# Extract data
items = browser.css('.product-item').map do |element|
{
title: element.at_css('.title')&.text,
price: element.at_css('.price')&.text,
url: element.at_css('a')&.attribute('href')
}
end
puts items.inspect
ensure
browser.quit
end
Best Practices for Dynamic Content Scraping
1. Implement Robust Error Handling
class RobustScraper
MAX_RETRIES = 3
def scrape_with_retry(url)
retries = 0
begin
driver.get(url)
wait_for_content_load
extract_data
rescue Selenium::WebDriver::Error::TimeoutError => e
retries += 1
if retries <= MAX_RETRIES
puts "Timeout error, retrying... (#{retries}/#{MAX_RETRIES})"
sleep(2)
retry
else
raise "Failed after #{MAX_RETRIES} retries: #{e.message}"
end
rescue => e
puts "Unexpected error: #{e.message}"
nil
end
end
private
def wait_for_content_load
wait = Selenium::WebDriver::Wait.new(timeout: 15)
wait.until do
driver.execute_script('return document.readyState') == 'complete' &&
driver.find_elements(css: '.loading-spinner').empty?
end
end
end
2. Optimize Performance
# Use page caching for repeated requests
class CachedScraper
def initialize
@cache = {}
setup_driver
end
def scrape_page(url, cache_key = nil)
cache_key ||= url
return @cache[cache_key] if @cache[cache_key]
@driver.get(url)
wait_for_dynamic_content
data = extract_data
@cache[cache_key] = data if data
data
end
private
def setup_driver
options = Selenium::WebDriver::Chrome::Options.new
options.add_argument('--headless')
options.add_argument('--disable-images') # Faster loading
options.add_argument('--disable-javascript') if static_content_only?
@driver = Selenium::WebDriver.for(:chrome, options: options)
end
end
3. Monitor Network Activity
When dealing with AJAX requests and dynamic loading, monitoring network activity helps ensure all data has loaded:
def wait_for_network_idle(timeout = 10)
start_time = Time.now
last_request_time = Time.now
# Monitor network requests
driver.execute_script(<<~JS)
window.activeRequests = 0;
(function() {
var originalFetch = window.fetch;
window.fetch = function() {
window.activeRequests++;
return originalFetch.apply(this, arguments)
.finally(() => window.activeRequests--);
};
})();
JS
loop do
active_requests = driver.execute_script('return window.activeRequests || 0')
if active_requests == 0
# No active requests, wait a bit more to be sure
sleep(1)
break if driver.execute_script('return window.activeRequests || 0') == 0
end
break if Time.now - start_time > timeout
sleep(0.5)
end
end
Common Challenges and Solutions
Handling Single Page Applications
SPAs require special handling as they often render content entirely through JavaScript:
def scrape_spa(url, route_selector = nil)
driver.get(url)
# Wait for initial app loading
wait = Selenium::WebDriver::Wait.new(timeout: 20)
wait.until { driver.find_element(css: '#app, [data-react-root], .vue-app') }
# Wait for route-specific content if specified
if route_selector
wait.until { driver.find_element(css: route_selector) }
end
# Additional wait for async data loading
sleep(3)
extract_spa_data
end
Dealing with Infinite Scroll
def scrape_infinite_scroll(max_scrolls = 10)
scroll_count = 0
previous_content_length = 0
while scroll_count < max_scrolls
# Get current content count
current_content = driver.find_elements(css: '.content-item')
break if current_content.length == previous_content_length
previous_content_length = current_content.length
# Scroll to bottom
driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
# Wait for new content
sleep(2)
scroll_count += 1
end
driver.find_elements(css: '.content-item').map(&:text)
end
Working with React and Vue Applications
Modern frontend frameworks often use virtual DOM and require specific waiting strategies:
def wait_for_react_app
# Wait for React to mount
wait = Selenium::WebDriver::Wait.new(timeout: 15)
wait.until do
driver.execute_script(<<~JS)
return window.React &&
document.querySelector('[data-reactroot]') &&
!document.querySelector('.loading, .spinner');
JS
end
end
def wait_for_vue_app
# Wait for Vue.js to initialize
wait = Selenium::WebDriver::Wait.new(timeout: 15)
wait.until do
driver.execute_script(<<~JS)
return window.Vue &&
document.querySelector('#app').__vue__ &&
!document.querySelector('.v-progress-circular');
JS
end
end
Performance Optimization Techniques
1. Disable Unnecessary Resources
def setup_optimized_driver
options = Selenium::WebDriver::Chrome::Options.new
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--disable-gpu')
options.add_argument('--disable-images')
options.add_argument('--disable-plugins')
options.add_argument('--disable-extensions')
options.add_argument('--window-size=1280,720')
# Block specific resource types
prefs = {
'profile.managed_default_content_settings.images' => 2,
'profile.managed_default_content_settings.stylesheets' => 2
}
options.add_preference(:prefs, prefs)
Selenium::WebDriver.for(:chrome, options: options)
end
2. Use Connection Pooling
class WebDriverPool
def initialize(size = 5)
@pool = Queue.new
size.times { @pool << create_driver }
end
def with_driver
driver = @pool.pop
begin
yield driver
ensure
@pool << driver
end
end
private
def create_driver
# Driver creation logic here
setup_optimized_driver
end
end
# Usage
pool = WebDriverPool.new(3)
results = []
urls.each do |url|
pool.with_driver do |driver|
results << scrape_page(driver, url)
end
end
Error Handling and Debugging
Advanced Error Recovery
class ResilieScraper
def initialize
@driver = setup_driver
@retry_count = 0
@max_retries = 3
end
def scrape_with_recovery(url)
begin
navigate_and_wait(url)
extract_content
rescue Selenium::WebDriver::Error::TimeoutError
handle_timeout_error
rescue Selenium::WebDriver::Error::NoSuchElementError
handle_missing_element
rescue Selenium::WebDriver::Error::StaleElementReferenceError
handle_stale_element
rescue Net::ReadTimeout
handle_network_timeout
end
end
private
def handle_timeout_error
if @retry_count < @max_retries
@retry_count += 1
puts "Timeout occurred, retrying... (#{@retry_count}/#{@max_retries})"
sleep(2)
retry
else
raise "Max retries exceeded due to timeouts"
end
end
def handle_missing_element
# Check if page loaded correctly
current_url = @driver.current_url
page_source = @driver.page_source
if page_source.include?('404') || page_source.include?('error')
raise "Page not found or error page detected"
end
# Wait longer for dynamic content
sleep(5)
retry
end
def handle_stale_element
# Re-find elements and retry operation
sleep(1)
retry
end
def handle_network_timeout
# Restart driver if network issues persist
@driver.quit
@driver = setup_driver
retry if @retry_count < @max_retries
end
end
Conclusion
Handling dynamic content in Ruby requires the right tools and strategies. Selenium WebDriver remains the most versatile option, while newer alternatives like Cuprite and Ferrum offer better performance for specific use cases. The key is to understand the loading patterns of your target website and implement appropriate waiting strategies.
Remember to always respect website terms of service, implement proper error handling, and consider the performance impact of running headless browsers. For complex scenarios involving timeouts and error handling, robust retry mechanisms are essential for reliable scraping operations.
By combining these techniques with proper monitoring and optimization, you can effectively scrape even the most dynamic modern web applications using Ruby.