What is Nokogiri and how is it used in web scraping?

What is Nokogiri?

Nokogiri is an open-source software library for parsing HTML and XML in Ruby. The name "Nokogiri" (鋸) is Japanese for "saw." In the context of the library, it's a tool for "cutting" through web pages to extract information, much like a saw cuts through wood. Nokogiri provides a simple and straightforward interface for navigating, searching, and modifying the parse tree of a document. It is particularly useful for web scraping because it can handle malformed markup and provides an easy-to-use API for querying and manipulating the DOM (Document Object Model) of a web page.

How is Nokogiri used in web scraping?

Nokogiri is used in web scraping by loading HTML content, either from a local file or from the web, and then providing methods to traverse and search the document using CSS selectors or XPath expressions. Once the relevant elements are identified, you can extract the text, attributes, or HTML from these elements according to the requirements of your scraping task.

Here's a basic example of how Nokogiri can be used in Ruby for web scraping:

require 'nokogiri'
require 'open-uri'

# Open a web page and create a Nokogiri HTML document
doc = Nokogiri::HTML(URI.open('https://example.com'))

# Search for nodes by CSS
doc.css('h1').each do |h1_tag|
  puts h1_tag.content
end

# Or use XPath expressions
doc.xpath('//h2').each do |h2_tag|
  puts h2_tag.content
end

# Extract an attribute
doc.css('a').each do |link|
  puts link['href']
end

In this example, we've done the following:

  1. Required the Nokogiri and open-uri libraries.
  2. Opened a web page using URI.open and parsed its HTML content with Nokogiri.
  3. Searched for <h1> tags using the .css method and printed their content.
  4. Searched for <h2> tags using the .xpath method and printed their content.
  5. Extracted and printed the href attribute from all <a> tags.

Nokogiri is powerful because it can handle both well-formed and poorly-formed HTML, making it robust for scraping real-world web pages that might not adhere to strict HTML standards. Additionally, it's capable of both reading from and writing to XML and HTML documents, which means you can not only extract data but also manipulate it and create new documents if needed.

Keep in mind that while web scraping can be a powerful tool, it's important to respect the terms of service of the website you are scraping and to not overload the server by making too many requests in a short period. Additionally, personal data should be handled in compliance with privacy laws such as GDPR or CCPA.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon