Nokogiri is a Ruby library used for parsing HTML and XML. It is very efficient for scraping static web pages, where the content is directly embedded in the HTML source code. However, Nokogiri by itself cannot be used to scrape JavaScript-generated content because it does not have the capability to execute JavaScript.
When you load a web page in a browser, the browser executes the JavaScript on the page, which can then manipulate the Document Object Model (DOM) and change the page's content dynamically. Nokogiri, on the other hand, only parses the static HTML content it is given and does not run any JavaScript code that may be present.
To scrape JavaScript-generated content, you need to use tools or libraries that can render JavaScript like a browser does. Here are a couple of approaches:
Using Selenium with Ruby
Selenium is a tool that automates web browsers. It can be used with a headless browser like Headless Chrome or Firefox in headless mode to scrape dynamic content.
require 'selenium-webdriver'
require 'nokogiri'
# Set up Selenium to use Chrome in headless mode
options = Selenium::WebDriver::Chrome::Options.new(args: ['headless'])
driver = Selenium::WebDriver.for(:chrome, options: options)
# Navigate to the page
driver.get('http://example.com')
# Wait for JavaScript to execute (if necessary)
sleep(2) # not the most robust way to wait for JS, but simple for this example
# Get the HTML content after JavaScript has executed
html = driver.page_source
# Parse the HTML with Nokogiri
doc = Nokogiri::HTML(html)
# Do the scraping (example: get all links)
doc.css('a').each do |link|
puts link['href']
end
# Close the browser
driver.quit
Using a headless browser with Ruby
You can also use a headless browser like puppeteer with Ruby via a bridge like Ferrum or Grover.
require 'ferrum'
browser = Ferrum::Browser.new
browser.goto('http://example.com')
# Wait for a specific element, indicating that the JavaScript has likely executed
browser.at_xpath('//selector-that-indicates-loaded-content')
# Get the HTML content
html = browser.body
# Parse with Nokogiri
doc = Nokogiri::HTML(html)
# Do the scraping
# ...
# Close the browser
browser.quit
Using other tools
Alternatively, if you're open to using tools other than Ruby and Nokogiri, you can use JavaScript-based tools like Puppeteer or Playwright in Node.js, which are designed to interact with headless browsers and can easily handle JavaScript-rendered content.
Here's an example using Puppeteer with JavaScript:
const puppeteer = require('puppeteer');
(async () => {
// Launch a headless browser
const browser = await puppeteer.launch();
// Open a new page
const page = await browser.newPage();
// Navigate to the page
await page.goto('http://example.com');
// Wait for JavaScript to execute
await page.waitForSelector('selector-that-indicates-loaded-content');
// Get the page content
const html = await page.content();
// Close the browser
await browser.close();
// Now you can use the "html" variable to parse the content with a library like cheerio (similar to Nokogiri for JavaScript)
})();
When scraping JavaScript-generated content, you must always ensure that you are complying with the website's terms of service and that your scraping activities are legal and ethical.