Table of contents

How do I scrape websites that use AJAX for dynamic content loading?

Scraping websites that load content dynamically via AJAX requires different approaches than traditional static HTML scraping. AJAX (Asynchronous JavaScript and XML) allows web pages to update content without full page reloads, making standard HTTP requests insufficient for capturing all data. This guide covers various Ruby techniques to handle AJAX-driven websites effectively.

Understanding AJAX and Dynamic Content

AJAX requests happen after the initial page load, often triggered by user interactions or timers. Traditional scraping tools like Nokogiri can only access the initial HTML, missing content loaded dynamically. You need tools that can execute JavaScript and wait for AJAX requests to complete.

Method 1: Using Headless Browsers with Selenium

Selenium WebDriver with a headless browser is the most reliable approach for AJAX-heavy sites. It renders JavaScript and waits for dynamic content to load.

Installation

gem install selenium-webdriver

Basic AJAX Scraping with Chrome

require 'selenium-webdriver'
require 'nokogiri'

# Configure Chrome in headless mode
options = Selenium::WebDriver::Chrome::Options.new
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

driver = Selenium::WebDriver.for :chrome, options: options

begin
  # Navigate to the page
  driver.navigate.to 'https://example.com/ajax-page'

  # Wait for AJAX content to load
  wait = Selenium::WebDriver::Wait.new(timeout: 10)
  wait.until { driver.find_element(css: '.dynamic-content') }

  # Get the fully rendered HTML
  html = driver.page_source
  doc = Nokogiri::HTML(html)

  # Extract data from dynamic content
  dynamic_data = doc.css('.dynamic-content').map(&:text)
  puts dynamic_data

ensure
  driver.quit
end

Waiting for Specific AJAX Requests

require 'selenium-webdriver'

driver = Selenium::WebDriver.for :chrome, options: options

begin
  driver.navigate.to 'https://example.com'

  # Click a button that triggers AJAX
  button = driver.find_element(css: '#load-more-btn')
  button.click

  # Wait for specific elements to appear
  wait = Selenium::WebDriver::Wait.new(timeout: 15)

  # Wait for multiple conditions
  wait.until do
    driver.find_elements(css: '.ajax-loaded-item').length >= 10 &&
    driver.find_element(css: '.loading-spinner').displayed? == false
  end

  # Extract the loaded content
  items = driver.find_elements(css: '.ajax-loaded-item')
  data = items.map { |item| item.text }

ensure
  driver.quit
end

Method 2: Using Ferrum (Chrome DevTools Protocol)

Ferrum provides a more lightweight alternative to Selenium by communicating directly with Chrome via the DevTools Protocol.

Installation

gem install ferrum

Basic Ferrum Implementation

require 'ferrum'
require 'nokogiri'

browser = Ferrum::Browser.new(headless: true)

begin
  browser.goto('https://example.com/ajax-page')

  # Wait for network idle (no requests for 500ms)
  browser.network.wait_for_idle

  # Alternative: wait for specific element
  browser.at_css('.dynamic-content', wait: 10)

  # Get rendered HTML
  html = browser.body
  doc = Nokogiri::HTML(html)

  # Extract data
  results = doc.css('.result-item').map do |item|
    {
      title: item.at_css('.title')&.text,
      price: item.at_css('.price')&.text,
      url: item.at_css('a')&.[]('href')
    }
  end

  puts results

ensure
  browser.quit
end

Intercepting AJAX Requests with Ferrum

require 'ferrum'
require 'json'

browser = Ferrum::Browser.new(headless: true)

# Capture network traffic
ajax_responses = []

browser.network.intercept do |request, response|
  # Filter for AJAX/API requests
  if request.url.include?('/api/') || 
     request.headers['Content-Type']&.include?('application/json')

    ajax_responses << {
      url: request.url,
      method: request.method,
      response_body: response.body,
      status: response.status
    }
  end
end

begin
  browser.goto('https://example.com')

  # Trigger AJAX requests
  browser.at_css('#search-button').click

  # Wait for requests to complete
  sleep(3)

  # Process intercepted AJAX data
  ajax_responses.each do |response|
    if response[:url].include?('/search')
      data = JSON.parse(response[:response_body])
      puts "Found #{data['results'].length} items"
    end
  end

ensure
  browser.quit
end

Method 3: Direct API Interaction

Sometimes it's more efficient to identify and call the AJAX endpoints directly, bypassing the browser entirely.

Analyzing Network Traffic

First, inspect the browser's Network tab to identify AJAX endpoints:

require 'net/http'
require 'json'
require 'uri'

# Reverse-engineer the AJAX endpoint
def scrape_ajax_endpoint(query, page = 1)
  uri = URI('https://example.com/api/search')

  http = Net::HTTP.new(uri.host, uri.port)
  http.use_ssl = true

  # Mimic browser headers
  headers = {
    'User-Agent' => 'Mozilla/5.0 (compatible; Ruby scraper)',
    'Accept' => 'application/json',
    'Content-Type' => 'application/json',
    'X-Requested-With' => 'XMLHttpRequest'
  }

  # Build request body
  body = {
    query: query,
    page: page,
    limit: 20
  }.to_json

  request = Net::HTTP::Post.new(uri, headers)
  request.body = body

  response = http.request(request)

  if response.code == '200'
    JSON.parse(response.body)
  else
    puts "Error: #{response.code} - #{response.message}"
    nil
  end
end

# Use the function
results = scrape_ajax_endpoint('ruby programming')
puts results['data'] if results

Method 4: Handling Pagination and Infinite Scroll

Many AJAX sites use infinite scroll or pagination that requires special handling:

require 'selenium-webdriver'

def scrape_infinite_scroll(url, max_items = 100)
  driver = Selenium::WebDriver.for :chrome, options: options
  wait = Selenium::WebDriver::Wait.new(timeout: 10)

  driver.navigate.to url
  all_items = []

  loop do
    # Get current items
    current_items = driver.find_elements(css: '.item')

    # Extract data from new items
    new_items = current_items[all_items.length..-1]
    new_data = new_items.map do |item|
      {
        title: item.find_element(css: '.title').text,
        description: item.find_element(css: '.description').text
      }
    end

    all_items.concat(new_data)

    # Break if we have enough items
    break if all_items.length >= max_items

    # Scroll to bottom to trigger more content
    driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')

    # Wait for new content to load
    begin
      wait.until { driver.find_elements(css: '.item').length > current_items.length }
    rescue Selenium::WebDriver::Error::TimeoutError
      # No more content to load
      break
    end

    sleep(1) # Be respectful with delays
  end

  driver.quit
  all_items[0...max_items]
end

Error Handling and Best Practices

Robust Error Handling

require 'selenium-webdriver'
require 'retries'

def scrape_with_retry(url, max_attempts = 3)
  with_retries(max_tries: max_attempts, rescue: [
    Selenium::WebDriver::Error::TimeoutError,
    Selenium::WebDriver::Error::NoSuchElementError,
    Net::ReadTimeout
  ]) do

    driver = Selenium::WebDriver.for :chrome, options: options

    begin
      driver.navigate.to url

      # Wait for page to be ready
      wait = Selenium::WebDriver::Wait.new(timeout: 15)
      wait.until { driver.execute_script('return document.readyState') == 'complete' }

      # Wait for AJAX content
      wait.until { 
        driver.find_elements(css: '.dynamic-content').any? &&
        driver.find_elements(css: '.loading').empty?
      }

      # Extract data
      elements = driver.find_elements(css: '.result')
      data = elements.map { |el| extract_element_data(el) }

      return data

    ensure
      driver.quit if driver
    end
  end
end

def extract_element_data(element)
  {
    title: safe_extract(element, '.title'),
    price: safe_extract(element, '.price'),
    rating: safe_extract(element, '.rating')
  }
end

def safe_extract(element, selector)
  element.find_element(css: selector).text
rescue Selenium::WebDriver::Error::NoSuchElementError
  nil
end

Performance Optimization

Resource Blocking

Improve scraping speed by blocking unnecessary resources:

require 'ferrum'

# Block images, CSS, and fonts to speed up loading
browser = Ferrum::Browser.new(
  headless: true,
  browser_options: {
    'args' => [
      '--disable-images',
      '--disable-javascript', # Only if JS isn't needed for your content
      '--disable-plugins',
      '--disable-extensions'
    ]
  }
)

# Or selectively block resources
browser.network.intercept do |request, response|
  if request.url.match?(/\.(jpg|jpeg|png|gif|css|woff|woff2)$/i)
    response.respond(status: 200, body: '')
  end
end

Parallel Processing

require 'concurrent-ruby'
require 'selenium-webdriver'

def scrape_urls_parallel(urls, max_threads = 4)
  pool = Concurrent::FixedThreadPool.new(max_threads)
  futures = []

  urls.each do |url|
    future = Concurrent::Future.execute(executor: pool) do
      scrape_single_url(url)
    end
    futures << future
  end

  # Wait for all to complete and collect results
  results = futures.map(&:value)
  pool.shutdown
  pool.wait_for_termination

  results.flatten.compact
end

Advanced Techniques

Handling SPAs and Complex State Management

For complex Single Page Applications, you might need to handle AJAX requests using Puppeteer or wait for specific application states:

def wait_for_spa_ready(driver)
  # Wait for framework to be ready (e.g., React, Vue, Angular)
  driver.execute_script(<<~JS)
    return new Promise((resolve) => {
      if (window.React && window.React.version) {
        // React app
        const checkReact = () => {
          if (document.querySelector('[data-reactroot]')) {
            resolve(true);
          } else {
            setTimeout(checkReact, 100);
          }
        };
        checkReact();
      } else if (window.Vue) {
        // Vue app
        resolve(!!document.querySelector('#app').__vue__);
      } else {
        // Generic check
        resolve(document.readyState === 'complete');
      }
    });
  JS
end

Monitoring Network Activity

For applications that make continuous AJAX requests, you can monitor network requests in Puppeteer or use similar techniques with Ruby:

def monitor_ajax_activity(driver, duration = 30)
  start_time = Time.now
  ajax_calls = []

  # Enable performance logging
  driver.manage.logs.get(:performance).each do |log|
    message = JSON.parse(log.message)
    if message['message']['method'] == 'Network.responseReceived'
      ajax_calls << message['message']['params']
    end
  end

  while Time.now - start_time < duration
    sleep(1)
    # Process new log entries
    driver.manage.logs.get(:performance).each do |log|
      # Handle new network events
    end
  end

  ajax_calls
end

Conclusion

Scraping AJAX-heavy websites requires patience and the right tools. Headless browsers like Chrome with Selenium or Ferrum provide the most reliable solution for complex dynamic content. For better performance, consider intercepting API calls directly when possible. Always implement proper error handling, respect rate limits, and ensure your scraping complies with the website's robots.txt and terms of service.

The key is to understand the specific AJAX patterns of your target website and choose the appropriate technique. Start with simple waits and element detection, then move to more advanced network interception if needed. Remember that handling timeouts in Puppeteer and similar Ruby tools is crucial for reliable scraping of dynamic content.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon