How can I scrape AJAX-loaded content with Nokogiri?

Nokogiri is a Ruby gem used for parsing HTML and XML. It's great for scraping static content from web pages, but it does not have the capability to execute JavaScript or fetch AJAX-loaded content directly. AJAX-loaded content is usually loaded asynchronously after the initial page load, often in response to user actions or after a certain event.

To scrape AJAX-loaded content with Nokogiri, you typically need to understand the underlying network requests that fetch the data. Here's how you can approach this:

  1. Inspect Network Requests: Use your browser's developer tools to inspect the network activity when the AJAX content is being loaded. Look for XHR (XMLHttpRequest) or Fetch requests that retrieve the data you are interested in.

  2. Make Direct HTTP Requests: Once you've identified the URLs and parameters for the AJAX requests, you can use Ruby's Net::HTTP library or other HTTP clients like HTTParty or Faraday to make direct requests to those URLs.

  3. Parse the Response: The response from the AJAX endpoint may be in JSON, XML, or HTML format. Use Nokogiri to parse HTML/XML responses, and JSON.parse for JSON responses.

  4. Extract Data: After parsing the response, you can use Nokogiri to extract the required data from the HTML/XML, or work with the JSON data directly if it's JSON.

Here's an example of how you might implement this in Ruby:

require 'nokogiri'
require 'open-uri'
require 'json'
require 'net/http'

# Step 1: Identify the URL that fetches the AJAX content.
ajax_url = 'http://example.com/ajax_endpoint'

# Step 2: Make a direct HTTP request to the AJAX URL.
uri = URI(ajax_url)
response = Net::HTTP.get(uri)

# Step 3: Parse the response. This example assumes the response is in JSON format.
data = JSON.parse(response)

# Step 4: If the data contains HTML that you need to scrape:
html_content = data['html']  # Assuming the JSON has an HTML field
doc = Nokogiri::HTML(html_content)

# Step 5: Use Nokogiri to extract the data you need.
items = doc.css('.item-class')  # Replace with the correct CSS selector
items.each do |item|
  puts item.text.strip
end

If the AJAX response is in HTML or XML format, you can parse it directly with Nokogiri:

# Parse the response as HTML
doc = Nokogiri::HTML(response)

# Extract data
items = doc.css('.item-class')  # Replace with the correct CSS selector
items.each do |item|
  puts item.text.strip
end

Keep in mind that scraping AJAX-loaded content can be more complex if the requests require headers, cookies, or tokens for authentication or session management. In such cases, you'll need to manage these aspects of HTTP requests within your script.

Also, be aware that web scraping can have legal and ethical implications. Always ensure that you are in compliance with the website's terms of service and applicable laws before scraping a site.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon