How do I scrape websites that require JavaScript execution in Ruby?
Many modern websites rely heavily on JavaScript to dynamically load content, making traditional HTTP-based scraping with libraries like Nokogiri insufficient. When websites use AJAX calls, single-page applications (SPAs), or client-side rendering, you need tools that can execute JavaScript to access the fully rendered content.
Understanding JavaScript-Heavy Websites
JavaScript-heavy websites present unique challenges for web scraping:
- Dynamic Content Loading: Content loads after the initial page load through AJAX requests
- Single Page Applications (SPAs): React, Vue, or Angular applications that render content client-side
- Infinite Scrolling: Content loads progressively as users scroll
- Interactive Elements: Buttons, forms, and modals that require user interaction
- Lazy Loading: Images and content that load only when visible
Traditional Ruby scraping libraries like Nokogiri work excellently for static HTML but cannot execute JavaScript, making them inadequate for these scenarios.
Solution 1: Using Ferrum (Recommended)
Ferrum is a pure Ruby library that controls headless Chrome browsers, making it ideal for scraping JavaScript-heavy websites.
Installation
Add Ferrum to your Gemfile:
gem 'ferrum'
Then run:
bundle install
Basic Ferrum Usage
require 'ferrum'
# Launch a new browser instance
browser = Ferrum::Browser.new
# Navigate to a JavaScript-heavy page
browser.goto('https://example.com/spa-page')
# Wait for JavaScript to load content
browser.network.wait_for_idle
# Extract content after JavaScript execution
page_content = browser.body
title = browser.title
# Use CSS selectors to find elements
products = browser.css('.product-item').map do |element|
{
name: element.text,
price: element.at_css('.price')&.text,
url: element.at_css('a')&.attribute('href')
}
end
# Close the browser
browser.quit
puts products
Advanced Ferrum Features
require 'ferrum'
browser = Ferrum::Browser.new(
headless: true, # Run in headless mode
window_size: [1920, 1080], # Set viewport size
timeout: 30 # Set page load timeout
)
# Handle AJAX-loaded content
browser.goto('https://example.com/dynamic-content')
# Wait for specific elements to appear
browser.at_css('.loading-spinner', wait: 10)
browser.wait_for_selector('.content-loaded')
# Scroll to trigger lazy loading
browser.execute('window.scrollTo(0, document.body.scrollHeight)')
# Wait for network requests to complete
browser.network.wait_for_idle(connections: 0, duration: 1)
# Take screenshots for debugging
browser.screenshot(path: 'debug.png')
# Execute custom JavaScript
result = browser.execute('return document.querySelectorAll(".item").length')
browser.quit
Solution 2: Using Watir with Chrome
Watir provides a Ruby interface for browser automation and works well with Chrome's headless mode.
Installation
gem 'watir'
gem 'webdrivers' # Automatically downloads and manages chromedriver
Basic Watir Usage
require 'watir'
# Launch Chrome in headless mode
browser = Watir::Browser.new :chrome, headless: true
# Navigate to the page
browser.goto 'https://example.com/javascript-heavy-page'
# Wait for elements to load
browser.div(class: 'content').wait_until(&:present?)
# Extract data after JavaScript execution
articles = browser.divs(class: 'article').map do |article|
{
title: article.h2.text,
content: article.p.text,
date: article.span(class: 'date').text
}
end
browser.close
puts articles
Handling Complex Interactions with Watir
require 'watir'
browser = Watir::Browser.new :chrome, headless: true
browser.goto 'https://example.com/interactive-page'
# Click buttons to load more content
load_more_btn = browser.button(text: 'Load More')
while load_more_btn.present?
load_more_btn.click
browser.div(class: 'loading').wait_while(&:present?)
load_more_btn = browser.button(text: 'Load More')
end
# Fill forms if needed
browser.text_field(name: 'search').set('ruby scraping')
browser.button(type: 'submit').click
# Wait for results
browser.div(class: 'results').wait_until(&:present?)
# Extract results
results = browser.divs(class: 'result-item').map(&:text)
browser.close
Solution 3: Using Kimurai Framework
Kimurai is a Ruby framework built on top of Capybara and Chrome, specifically designed for web scraping.
Installation
gem 'kimurai'
Basic Kimurai Spider
require 'kimurai'
class JavaScriptSpider < Kimurai::Base
@name = 'javascript_spider'
@engine = :selenium_chrome_headless
@start_urls = ['https://example.com/spa-page']
def parse(response, url:, data: {})
# Wait for JavaScript to load
browser.execute_script("return jQuery.active == 0")
# Extract data
response.css('.product').each do |product|
item = {
name: product.css('.name').text,
price: product.css('.price').text,
description: product.css('.description').text
}
save_to_json(item)
end
# Follow pagination
next_page = response.at_css('.pagination .next')
if next_page
request_to(:parse, url: absolute_url(next_page[:href], base: url))
end
end
end
JavaScriptSpider.crawl!
Solution 4: API-First Approach
Before implementing browser automation, check if the website offers APIs that provide the same data:
require 'net/http'
require 'json'
# Many SPAs load data via API calls
uri = URI('https://api.example.com/products')
response = Net::HTTP.get_response(uri)
if response.code == '200'
data = JSON.parse(response.body)
products = data['products'].map do |product|
{
name: product['name'],
price: product['price'],
description: product['description']
}
end
puts products
end
Best Practices and Performance Tips
1. Resource Management
require 'ferrum'
class JavaScriptScraper
def initialize
@browser = Ferrum::Browser.new(
headless: true,
window_size: [1920, 1080]
)
end
def scrape_multiple_pages(urls)
results = []
urls.each do |url|
begin
@browser.goto(url)
@browser.network.wait_for_idle
# Extract data
data = extract_page_data
results << data
rescue => e
puts "Error scraping #{url}: #{e.message}"
end
end
results
ensure
@browser&.quit
end
private
def extract_page_data
# Your extraction logic here
end
end
2. Handling Dynamic Content
def wait_for_content_load(browser)
# Wait for specific elements
browser.at_css('.content-loaded', wait: 10)
# Wait for AJAX requests to complete
browser.network.wait_for_idle(connections: 0, duration: 2)
# Wait for custom JavaScript conditions
browser.execute(<<~JS)
return new Promise((resolve) => {
const checkCondition = () => {
if (window.dataLoaded && document.querySelectorAll('.item').length > 0) {
resolve(true);
} else {
setTimeout(checkCondition, 100);
}
};
checkCondition();
});
JS
end
3. Error Handling and Retries
def scrape_with_retry(url, max_retries: 3)
retries = 0
begin
browser.goto(url)
wait_for_content_load(browser)
extract_page_data
rescue Ferrum::TimeoutError, Ferrum::NodeNotFoundError => e
retries += 1
if retries <= max_retries
puts "Retry #{retries}/#{max_retries} for #{url}"
sleep(2 ** retries) # Exponential backoff
retry
else
puts "Failed to scrape #{url} after #{max_retries} retries: #{e.message}"
nil
end
end
end
Debugging JavaScript-Heavy Scraping
1. Visual Debugging
browser = Ferrum::Browser.new(headless: false) # Run with GUI for debugging
browser.goto('https://example.com')
# Take screenshots at different stages
browser.screenshot(path: 'before_interaction.png')
# Interact with the page
browser.at_css('.load-more-btn').click
browser.screenshot(path: 'after_interaction.png')
2. Console Logging
# Monitor browser console for errors
browser.on(:console) do |message|
puts "Console #{message.type}: #{message.text}"
end
# Check for JavaScript errors
errors = browser.execute('return window.errors || []')
puts "JavaScript errors: #{errors}" if errors.any?
When to Use Each Approach
- Ferrum: Best for most Ruby applications, pure Ruby implementation, good performance
- Watir: Excellent for complex user interactions and testing scenarios
- Kimurai: Ideal for large-scale scraping projects with built-in data processing
- API Approach: Always try this first - it's faster and more reliable when available
Performance Considerations
JavaScript execution adds significant overhead compared to static HTML scraping. Consider these optimizations:
- Disable unnecessary features:
browser = Ferrum::Browser.new(
browser_options: {
'no-sandbox': nil,
'disable-gpu': nil,
'disable-dev-shm-usage': nil,
'disable-images': nil # Disable image loading
}
)
- Reuse browser instances for multiple pages
- Use connection pooling for concurrent scraping
- Implement caching for repeated requests
When dealing with complex JavaScript applications, understanding how to handle AJAX requests using Puppeteer concepts can help you apply similar techniques in Ruby. For applications requiring precise timing, learning about browser session management provides valuable insights for Ruby-based solutions.
Conclusion
Scraping JavaScript-heavy websites in Ruby requires browser automation tools rather than simple HTTP clients. Ferrum offers the best balance of performance and ease of use for most Ruby applications, while Watir excels in complex interaction scenarios. Always consider API alternatives first, as they provide better performance and reliability than browser automation when available.