How do I scrape websites that use AJAX for dynamic content loading?
Scraping websites that load content dynamically via AJAX requires different approaches than traditional static HTML scraping. AJAX (Asynchronous JavaScript and XML) allows web pages to update content without full page reloads, making standard HTTP requests insufficient for capturing all data. This guide covers various Ruby techniques to handle AJAX-driven websites effectively.
Understanding AJAX and Dynamic Content
AJAX requests happen after the initial page load, often triggered by user interactions or timers. Traditional scraping tools like Nokogiri can only access the initial HTML, missing content loaded dynamically. You need tools that can execute JavaScript and wait for AJAX requests to complete.
Method 1: Using Headless Browsers with Selenium
Selenium WebDriver with a headless browser is the most reliable approach for AJAX-heavy sites. It renders JavaScript and waits for dynamic content to load.
Installation
gem install selenium-webdriver
Basic AJAX Scraping with Chrome
require 'selenium-webdriver'
require 'nokogiri'
# Configure Chrome in headless mode
options = Selenium::WebDriver::Chrome::Options.new
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
driver = Selenium::WebDriver.for :chrome, options: options
begin
# Navigate to the page
driver.navigate.to 'https://example.com/ajax-page'
# Wait for AJAX content to load
wait = Selenium::WebDriver::Wait.new(timeout: 10)
wait.until { driver.find_element(css: '.dynamic-content') }
# Get the fully rendered HTML
html = driver.page_source
doc = Nokogiri::HTML(html)
# Extract data from dynamic content
dynamic_data = doc.css('.dynamic-content').map(&:text)
puts dynamic_data
ensure
driver.quit
end
Waiting for Specific AJAX Requests
require 'selenium-webdriver'
driver = Selenium::WebDriver.for :chrome, options: options
begin
driver.navigate.to 'https://example.com'
# Click a button that triggers AJAX
button = driver.find_element(css: '#load-more-btn')
button.click
# Wait for specific elements to appear
wait = Selenium::WebDriver::Wait.new(timeout: 15)
# Wait for multiple conditions
wait.until do
driver.find_elements(css: '.ajax-loaded-item').length >= 10 &&
driver.find_element(css: '.loading-spinner').displayed? == false
end
# Extract the loaded content
items = driver.find_elements(css: '.ajax-loaded-item')
data = items.map { |item| item.text }
ensure
driver.quit
end
Method 2: Using Ferrum (Chrome DevTools Protocol)
Ferrum provides a more lightweight alternative to Selenium by communicating directly with Chrome via the DevTools Protocol.
Installation
gem install ferrum
Basic Ferrum Implementation
require 'ferrum'
require 'nokogiri'
browser = Ferrum::Browser.new(headless: true)
begin
browser.goto('https://example.com/ajax-page')
# Wait for network idle (no requests for 500ms)
browser.network.wait_for_idle
# Alternative: wait for specific element
browser.at_css('.dynamic-content', wait: 10)
# Get rendered HTML
html = browser.body
doc = Nokogiri::HTML(html)
# Extract data
results = doc.css('.result-item').map do |item|
{
title: item.at_css('.title')&.text,
price: item.at_css('.price')&.text,
url: item.at_css('a')&.[]('href')
}
end
puts results
ensure
browser.quit
end
Intercepting AJAX Requests with Ferrum
require 'ferrum'
require 'json'
browser = Ferrum::Browser.new(headless: true)
# Capture network traffic
ajax_responses = []
browser.network.intercept do |request, response|
# Filter for AJAX/API requests
if request.url.include?('/api/') ||
request.headers['Content-Type']&.include?('application/json')
ajax_responses << {
url: request.url,
method: request.method,
response_body: response.body,
status: response.status
}
end
end
begin
browser.goto('https://example.com')
# Trigger AJAX requests
browser.at_css('#search-button').click
# Wait for requests to complete
sleep(3)
# Process intercepted AJAX data
ajax_responses.each do |response|
if response[:url].include?('/search')
data = JSON.parse(response[:response_body])
puts "Found #{data['results'].length} items"
end
end
ensure
browser.quit
end
Method 3: Direct API Interaction
Sometimes it's more efficient to identify and call the AJAX endpoints directly, bypassing the browser entirely.
Analyzing Network Traffic
First, inspect the browser's Network tab to identify AJAX endpoints:
require 'net/http'
require 'json'
require 'uri'
# Reverse-engineer the AJAX endpoint
def scrape_ajax_endpoint(query, page = 1)
uri = URI('https://example.com/api/search')
http = Net::HTTP.new(uri.host, uri.port)
http.use_ssl = true
# Mimic browser headers
headers = {
'User-Agent' => 'Mozilla/5.0 (compatible; Ruby scraper)',
'Accept' => 'application/json',
'Content-Type' => 'application/json',
'X-Requested-With' => 'XMLHttpRequest'
}
# Build request body
body = {
query: query,
page: page,
limit: 20
}.to_json
request = Net::HTTP::Post.new(uri, headers)
request.body = body
response = http.request(request)
if response.code == '200'
JSON.parse(response.body)
else
puts "Error: #{response.code} - #{response.message}"
nil
end
end
# Use the function
results = scrape_ajax_endpoint('ruby programming')
puts results['data'] if results
Method 4: Handling Pagination and Infinite Scroll
Many AJAX sites use infinite scroll or pagination that requires special handling:
require 'selenium-webdriver'
def scrape_infinite_scroll(url, max_items = 100)
driver = Selenium::WebDriver.for :chrome, options: options
wait = Selenium::WebDriver::Wait.new(timeout: 10)
driver.navigate.to url
all_items = []
loop do
# Get current items
current_items = driver.find_elements(css: '.item')
# Extract data from new items
new_items = current_items[all_items.length..-1]
new_data = new_items.map do |item|
{
title: item.find_element(css: '.title').text,
description: item.find_element(css: '.description').text
}
end
all_items.concat(new_data)
# Break if we have enough items
break if all_items.length >= max_items
# Scroll to bottom to trigger more content
driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
# Wait for new content to load
begin
wait.until { driver.find_elements(css: '.item').length > current_items.length }
rescue Selenium::WebDriver::Error::TimeoutError
# No more content to load
break
end
sleep(1) # Be respectful with delays
end
driver.quit
all_items[0...max_items]
end
Error Handling and Best Practices
Robust Error Handling
require 'selenium-webdriver'
require 'retries'
def scrape_with_retry(url, max_attempts = 3)
with_retries(max_tries: max_attempts, rescue: [
Selenium::WebDriver::Error::TimeoutError,
Selenium::WebDriver::Error::NoSuchElementError,
Net::ReadTimeout
]) do
driver = Selenium::WebDriver.for :chrome, options: options
begin
driver.navigate.to url
# Wait for page to be ready
wait = Selenium::WebDriver::Wait.new(timeout: 15)
wait.until { driver.execute_script('return document.readyState') == 'complete' }
# Wait for AJAX content
wait.until {
driver.find_elements(css: '.dynamic-content').any? &&
driver.find_elements(css: '.loading').empty?
}
# Extract data
elements = driver.find_elements(css: '.result')
data = elements.map { |el| extract_element_data(el) }
return data
ensure
driver.quit if driver
end
end
end
def extract_element_data(element)
{
title: safe_extract(element, '.title'),
price: safe_extract(element, '.price'),
rating: safe_extract(element, '.rating')
}
end
def safe_extract(element, selector)
element.find_element(css: selector).text
rescue Selenium::WebDriver::Error::NoSuchElementError
nil
end
Performance Optimization
Resource Blocking
Improve scraping speed by blocking unnecessary resources:
require 'ferrum'
# Block images, CSS, and fonts to speed up loading
browser = Ferrum::Browser.new(
headless: true,
browser_options: {
'args' => [
'--disable-images',
'--disable-javascript', # Only if JS isn't needed for your content
'--disable-plugins',
'--disable-extensions'
]
}
)
# Or selectively block resources
browser.network.intercept do |request, response|
if request.url.match?(/\.(jpg|jpeg|png|gif|css|woff|woff2)$/i)
response.respond(status: 200, body: '')
end
end
Parallel Processing
require 'concurrent-ruby'
require 'selenium-webdriver'
def scrape_urls_parallel(urls, max_threads = 4)
pool = Concurrent::FixedThreadPool.new(max_threads)
futures = []
urls.each do |url|
future = Concurrent::Future.execute(executor: pool) do
scrape_single_url(url)
end
futures << future
end
# Wait for all to complete and collect results
results = futures.map(&:value)
pool.shutdown
pool.wait_for_termination
results.flatten.compact
end
Advanced Techniques
Handling SPAs and Complex State Management
For complex Single Page Applications, you might need to handle AJAX requests using Puppeteer or wait for specific application states:
def wait_for_spa_ready(driver)
# Wait for framework to be ready (e.g., React, Vue, Angular)
driver.execute_script(<<~JS)
return new Promise((resolve) => {
if (window.React && window.React.version) {
// React app
const checkReact = () => {
if (document.querySelector('[data-reactroot]')) {
resolve(true);
} else {
setTimeout(checkReact, 100);
}
};
checkReact();
} else if (window.Vue) {
// Vue app
resolve(!!document.querySelector('#app').__vue__);
} else {
// Generic check
resolve(document.readyState === 'complete');
}
});
JS
end
Monitoring Network Activity
For applications that make continuous AJAX requests, you can monitor network requests in Puppeteer or use similar techniques with Ruby:
def monitor_ajax_activity(driver, duration = 30)
start_time = Time.now
ajax_calls = []
# Enable performance logging
driver.manage.logs.get(:performance).each do |log|
message = JSON.parse(log.message)
if message['message']['method'] == 'Network.responseReceived'
ajax_calls << message['message']['params']
end
end
while Time.now - start_time < duration
sleep(1)
# Process new log entries
driver.manage.logs.get(:performance).each do |log|
# Handle new network events
end
end
ajax_calls
end
Conclusion
Scraping AJAX-heavy websites requires patience and the right tools. Headless browsers like Chrome with Selenium or Ferrum provide the most reliable solution for complex dynamic content. For better performance, consider intercepting API calls directly when possible. Always implement proper error handling, respect rate limits, and ensure your scraping complies with the website's robots.txt and terms of service.
The key is to understand the specific AJAX patterns of your target website and choose the appropriate technique. Start with simple waits and element detection, then move to more advanced network interception if needed. Remember that handling timeouts in Puppeteer and similar Ruby tools is crucial for reliable scraping of dynamic content.