Can Nokogiri parse XML documents as well as HTML?

Yes, Nokogiri is a versatile library that can parse both XML and HTML documents. It is written in Ruby and provides an easy-to-use interface for parsing, querying, and manipulating XML and HTML content. Nokogiri leverages the libxml2 library under the hood, which is a powerful XML parser that ensures speed and compliance with a wide range of XML standards.

Here's how you can use Nokogiri to parse an XML document in Ruby:

require 'nokogiri'
require 'open-uri'

# Sample XML content
xml_content = <<-XML
<?xml version="1.0" encoding="UTF-8"?>
<root>
  <element attribute="value">Content</element>
</root>
XML

# Parse the XML content with Nokogiri
doc = Nokogiri::XML(xml_content)

# Access elements using XPath or CSS selectors
element = doc.xpath('//element').first
puts element.text # => Content

# You can also access attributes
puts element['attribute'] # => value

Similarly, you can parse an HTML document using Nokogiri as follows:

require 'nokogiri'
require 'open-uri'

# Sample HTML content
html_content = <<-HTML
<!DOCTYPE html>
<html>
<head>
  <title>Sample Page</title>
</head>
<body>
  <h1>Hello, Nokogiri!</h1>
  <p class="description">This is a sample paragraph.</p>
</body>
</html>
HTML

# Parse the HTML content with Nokogiri
doc = Nokogiri::HTML(html_content)

# Access elements using XPath or CSS selectors
heading = doc.css('h1').first
puts heading.text # => Hello, Nokogiri!

# Get the class attribute of the paragraph
paragraph = doc.css('p.description').first
puts paragraph['class'] # => description

Nokogiri's ability to parse both XML and HTML with a consistent API makes it a popular choice for web scraping and data extraction tasks in Ruby. The library's comprehensive documentation provides detailed information on how to handle various parsing and querying scenarios.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon