Table of contents

How do I scrape websites that require JavaScript execution in Ruby?

Many modern websites rely heavily on JavaScript to dynamically load content, making traditional HTTP-based scraping with libraries like Nokogiri insufficient. When websites use AJAX calls, single-page applications (SPAs), or client-side rendering, you need tools that can execute JavaScript to access the fully rendered content.

Understanding JavaScript-Heavy Websites

JavaScript-heavy websites present unique challenges for web scraping:

  • Dynamic Content Loading: Content loads after the initial page load through AJAX requests
  • Single Page Applications (SPAs): React, Vue, or Angular applications that render content client-side
  • Infinite Scrolling: Content loads progressively as users scroll
  • Interactive Elements: Buttons, forms, and modals that require user interaction
  • Lazy Loading: Images and content that load only when visible

Traditional Ruby scraping libraries like Nokogiri work excellently for static HTML but cannot execute JavaScript, making them inadequate for these scenarios.

Solution 1: Using Ferrum (Recommended)

Ferrum is a pure Ruby library that controls headless Chrome browsers, making it ideal for scraping JavaScript-heavy websites.

Installation

Add Ferrum to your Gemfile:

gem 'ferrum'

Then run:

bundle install

Basic Ferrum Usage

require 'ferrum'

# Launch a new browser instance
browser = Ferrum::Browser.new

# Navigate to a JavaScript-heavy page
browser.goto('https://example.com/spa-page')

# Wait for JavaScript to load content
browser.network.wait_for_idle

# Extract content after JavaScript execution
page_content = browser.body
title = browser.title

# Use CSS selectors to find elements
products = browser.css('.product-item').map do |element|
  {
    name: element.text,
    price: element.at_css('.price')&.text,
    url: element.at_css('a')&.attribute('href')
  }
end

# Close the browser
browser.quit

puts products

Advanced Ferrum Features

require 'ferrum'

browser = Ferrum::Browser.new(
  headless: true,           # Run in headless mode
  window_size: [1920, 1080], # Set viewport size
  timeout: 30               # Set page load timeout
)

# Handle AJAX-loaded content
browser.goto('https://example.com/dynamic-content')

# Wait for specific elements to appear
browser.at_css('.loading-spinner', wait: 10)
browser.wait_for_selector('.content-loaded')

# Scroll to trigger lazy loading
browser.execute('window.scrollTo(0, document.body.scrollHeight)')

# Wait for network requests to complete
browser.network.wait_for_idle(connections: 0, duration: 1)

# Take screenshots for debugging
browser.screenshot(path: 'debug.png')

# Execute custom JavaScript
result = browser.execute('return document.querySelectorAll(".item").length')

browser.quit

Solution 2: Using Watir with Chrome

Watir provides a Ruby interface for browser automation and works well with Chrome's headless mode.

Installation

gem 'watir'
gem 'webdrivers' # Automatically downloads and manages chromedriver

Basic Watir Usage

require 'watir'

# Launch Chrome in headless mode
browser = Watir::Browser.new :chrome, headless: true

# Navigate to the page
browser.goto 'https://example.com/javascript-heavy-page'

# Wait for elements to load
browser.div(class: 'content').wait_until(&:present?)

# Extract data after JavaScript execution
articles = browser.divs(class: 'article').map do |article|
  {
    title: article.h2.text,
    content: article.p.text,
    date: article.span(class: 'date').text
  }
end

browser.close
puts articles

Handling Complex Interactions with Watir

require 'watir'

browser = Watir::Browser.new :chrome, headless: true
browser.goto 'https://example.com/interactive-page'

# Click buttons to load more content
load_more_btn = browser.button(text: 'Load More')
while load_more_btn.present?
  load_more_btn.click
  browser.div(class: 'loading').wait_while(&:present?)
  load_more_btn = browser.button(text: 'Load More')
end

# Fill forms if needed
browser.text_field(name: 'search').set('ruby scraping')
browser.button(type: 'submit').click

# Wait for results
browser.div(class: 'results').wait_until(&:present?)

# Extract results
results = browser.divs(class: 'result-item').map(&:text)

browser.close

Solution 3: Using Kimurai Framework

Kimurai is a Ruby framework built on top of Capybara and Chrome, specifically designed for web scraping.

Installation

gem 'kimurai'

Basic Kimurai Spider

require 'kimurai'

class JavaScriptSpider < Kimurai::Base
  @name = 'javascript_spider'
  @engine = :selenium_chrome_headless
  @start_urls = ['https://example.com/spa-page']

  def parse(response, url:, data: {})
    # Wait for JavaScript to load
    browser.execute_script("return jQuery.active == 0")

    # Extract data
    response.css('.product').each do |product|
      item = {
        name: product.css('.name').text,
        price: product.css('.price').text,
        description: product.css('.description').text
      }

      save_to_json(item)
    end

    # Follow pagination
    next_page = response.at_css('.pagination .next')
    if next_page
      request_to(:parse, url: absolute_url(next_page[:href], base: url))
    end
  end
end

JavaScriptSpider.crawl!

Solution 4: API-First Approach

Before implementing browser automation, check if the website offers APIs that provide the same data:

require 'net/http'
require 'json'

# Many SPAs load data via API calls
uri = URI('https://api.example.com/products')
response = Net::HTTP.get_response(uri)

if response.code == '200'
  data = JSON.parse(response.body)
  products = data['products'].map do |product|
    {
      name: product['name'],
      price: product['price'],
      description: product['description']
    }
  end

  puts products
end

Best Practices and Performance Tips

1. Resource Management

require 'ferrum'

class JavaScriptScraper
  def initialize
    @browser = Ferrum::Browser.new(
      headless: true,
      window_size: [1920, 1080]
    )
  end

  def scrape_multiple_pages(urls)
    results = []

    urls.each do |url|
      begin
        @browser.goto(url)
        @browser.network.wait_for_idle

        # Extract data
        data = extract_page_data
        results << data

      rescue => e
        puts "Error scraping #{url}: #{e.message}"
      end
    end

    results
  ensure
    @browser&.quit
  end

  private

  def extract_page_data
    # Your extraction logic here
  end
end

2. Handling Dynamic Content

def wait_for_content_load(browser)
  # Wait for specific elements
  browser.at_css('.content-loaded', wait: 10)

  # Wait for AJAX requests to complete
  browser.network.wait_for_idle(connections: 0, duration: 2)

  # Wait for custom JavaScript conditions
  browser.execute(<<~JS)
    return new Promise((resolve) => {
      const checkCondition = () => {
        if (window.dataLoaded && document.querySelectorAll('.item').length > 0) {
          resolve(true);
        } else {
          setTimeout(checkCondition, 100);
        }
      };
      checkCondition();
    });
  JS
end

3. Error Handling and Retries

def scrape_with_retry(url, max_retries: 3)
  retries = 0

  begin
    browser.goto(url)
    wait_for_content_load(browser)
    extract_page_data

  rescue Ferrum::TimeoutError, Ferrum::NodeNotFoundError => e
    retries += 1
    if retries <= max_retries
      puts "Retry #{retries}/#{max_retries} for #{url}"
      sleep(2 ** retries) # Exponential backoff
      retry
    else
      puts "Failed to scrape #{url} after #{max_retries} retries: #{e.message}"
      nil
    end
  end
end

Debugging JavaScript-Heavy Scraping

1. Visual Debugging

browser = Ferrum::Browser.new(headless: false) # Run with GUI for debugging
browser.goto('https://example.com')

# Take screenshots at different stages
browser.screenshot(path: 'before_interaction.png')

# Interact with the page
browser.at_css('.load-more-btn').click

browser.screenshot(path: 'after_interaction.png')

2. Console Logging

# Monitor browser console for errors
browser.on(:console) do |message|
  puts "Console #{message.type}: #{message.text}"
end

# Check for JavaScript errors
errors = browser.execute('return window.errors || []')
puts "JavaScript errors: #{errors}" if errors.any?

When to Use Each Approach

  • Ferrum: Best for most Ruby applications, pure Ruby implementation, good performance
  • Watir: Excellent for complex user interactions and testing scenarios
  • Kimurai: Ideal for large-scale scraping projects with built-in data processing
  • API Approach: Always try this first - it's faster and more reliable when available

Performance Considerations

JavaScript execution adds significant overhead compared to static HTML scraping. Consider these optimizations:

  1. Disable unnecessary features:
browser = Ferrum::Browser.new(
  browser_options: {
    'no-sandbox': nil,
    'disable-gpu': nil,
    'disable-dev-shm-usage': nil,
    'disable-images': nil # Disable image loading
  }
)
  1. Reuse browser instances for multiple pages
  2. Use connection pooling for concurrent scraping
  3. Implement caching for repeated requests

When dealing with complex JavaScript applications, understanding how to handle AJAX requests using Puppeteer concepts can help you apply similar techniques in Ruby. For applications requiring precise timing, learning about browser session management provides valuable insights for Ruby-based solutions.

Conclusion

Scraping JavaScript-heavy websites in Ruby requires browser automation tools rather than simple HTTP clients. Ferrum offers the best balance of performance and ease of use for most Ruby applications, while Watir excels in complex interaction scenarios. Always consider API alternatives first, as they provide better performance and reliability than browser automation when available.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon