Can Nokogiri be used to scrape JavaScript-generated content?

Nokogiri is a Ruby library used for parsing HTML and XML. It is very efficient for scraping static web pages, where the content is directly embedded in the HTML source code. However, Nokogiri by itself cannot be used to scrape JavaScript-generated content because it does not have the capability to execute JavaScript.

When you load a web page in a browser, the browser executes the JavaScript on the page, which can then manipulate the Document Object Model (DOM) and change the page's content dynamically. Nokogiri, on the other hand, only parses the static HTML content it is given and does not run any JavaScript code that may be present.

To scrape JavaScript-generated content, you need to use tools or libraries that can render JavaScript like a browser does. Here are a couple of approaches:

Using Selenium with Ruby

Selenium is a tool that automates web browsers. It can be used with a headless browser like Headless Chrome or Firefox in headless mode to scrape dynamic content.

require 'selenium-webdriver'
require 'nokogiri'

# Set up Selenium to use Chrome in headless mode
options = Selenium::WebDriver::Chrome::Options.new(args: ['headless'])
driver = Selenium::WebDriver.for(:chrome, options: options)

# Navigate to the page
driver.get('http://example.com')

# Wait for JavaScript to execute (if necessary)
sleep(2) # not the most robust way to wait for JS, but simple for this example

# Get the HTML content after JavaScript has executed
html = driver.page_source

# Parse the HTML with Nokogiri
doc = Nokogiri::HTML(html)

# Do the scraping (example: get all links)
doc.css('a').each do |link|
  puts link['href']
end

# Close the browser
driver.quit

Using a headless browser with Ruby

You can also use a headless browser like puppeteer with Ruby via a bridge like Ferrum or Grover.

require 'ferrum'

browser = Ferrum::Browser.new

browser.goto('http://example.com')

# Wait for a specific element, indicating that the JavaScript has likely executed
browser.at_xpath('//selector-that-indicates-loaded-content')

# Get the HTML content
html = browser.body

# Parse with Nokogiri
doc = Nokogiri::HTML(html)

# Do the scraping
# ...

# Close the browser
browser.quit

Using other tools

Alternatively, if you're open to using tools other than Ruby and Nokogiri, you can use JavaScript-based tools like Puppeteer or Playwright in Node.js, which are designed to interact with headless browsers and can easily handle JavaScript-rendered content.

Here's an example using Puppeteer with JavaScript:

const puppeteer = require('puppeteer');

(async () => {
  // Launch a headless browser
  const browser = await puppeteer.launch();

  // Open a new page
  const page = await browser.newPage();

  // Navigate to the page
  await page.goto('http://example.com');

  // Wait for JavaScript to execute
  await page.waitForSelector('selector-that-indicates-loaded-content');

  // Get the page content
  const html = await page.content();

  // Close the browser
  await browser.close();

  // Now you can use the "html" variable to parse the content with a library like cheerio (similar to Nokogiri for JavaScript)
})();

When scraping JavaScript-generated content, you must always ensure that you are complying with the website's terms of service and that your scraping activities are legal and ethical.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon