How to Parse XML Documents with Nokogiri
Nokogiri is Ruby's most powerful and popular XML/HTML parsing library, offering fast and efficient XML document processing capabilities. Whether you're working with configuration files, API responses, or data feeds, Nokogiri provides comprehensive tools for parsing, navigating, and manipulating XML documents.
What is Nokogiri?
Nokogiri is a Ruby gem that wraps the libxml2 C library, providing a simple and intuitive Ruby API for XML and HTML parsing. It supports XPath and CSS selectors, making it an excellent choice for both simple and complex XML processing tasks.
Installing Nokogiri
First, add Nokogiri to your Gemfile or install it directly:
# Using Bundler
echo 'gem "nokogiri"' >> Gemfile
bundle install
# Direct installation
gem install nokogiri
Basic XML Parsing
Parsing XML from a String
The most common way to parse XML is from a string using Nokogiri::XML()
:
require 'nokogiri'
xml_string = <<~XML
<?xml version="1.0" encoding="UTF-8"?>
<catalog>
<book id="1">
<title>Ruby Programming</title>
<author>John Doe</author>
<price currency="USD">29.99</price>
<category>Programming</category>
</book>
<book id="2">
<title>Web Scraping Guide</title>
<author>Jane Smith</author>
<price currency="EUR">24.50</price>
<category>Web Development</category>
</book>
</catalog>
XML
# Parse the XML document
doc = Nokogiri::XML(xml_string)
# Check if parsing was successful
if doc.errors.empty?
puts "XML parsed successfully"
else
puts "Parsing errors: #{doc.errors}"
end
Parsing XML from a File
require 'nokogiri'
# Parse XML from a file
doc = Nokogiri::XML(File.open('catalog.xml'))
# Alternative using File.read
xml_content = File.read('catalog.xml')
doc = Nokogiri::XML(xml_content)
Parsing XML from a URL
require 'nokogiri'
require 'open-uri'
# Parse XML from a remote URL
url = 'https://example.com/data.xml'
doc = Nokogiri::XML(URI.open(url))
Navigating XML Documents
Accessing Elements by Tag Name
# Get all book elements
books = doc.xpath('//book')
# or using CSS selectors
books = doc.css('book')
books.each do |book|
puts "Book ID: #{book['id']}"
puts "Title: #{book.at('title').text}"
puts "Author: #{book.at('author').text}"
puts "---"
end
Using XPath Selectors
XPath provides powerful querying capabilities for XML documents:
# Find books with specific criteria
expensive_books = doc.xpath('//book[price > 25]')
programming_books = doc.xpath('//book[category="Programming"]')
# Get specific attributes
book_ids = doc.xpath('//book/@id').map(&:value)
# Find elements with specific attributes
usd_prices = doc.xpath('//price[@currency="USD"]')
Using CSS Selectors
CSS selectors offer a more familiar syntax for web developers:
# Select elements using CSS selectors
titles = doc.css('book title').map(&:text)
authors = doc.css('book author').map(&:text)
# Select elements with attributes
first_book = doc.css('book[id="1"]').first
usd_books = doc.css('price[currency="USD"]')
Extracting Data from XML
Getting Element Text Content
# Extract text from elements
book = doc.at('book')
title = book.at('title').text
author = book.at('author').text
# Handle missing elements safely
price_element = book.at('price')
price = price_element ? price_element.text : 'N/A'
Accessing Element Attributes
# Get attribute values
book = doc.at('book')
book_id = book['id']
# or using attr method
book_id = book.attr('id')
# Get all attributes
price_element = book.at('price')
currency = price_element['currency']
price_value = price_element.text
Working with Multiple Elements
# Process all books
books_data = []
doc.css('book').each do |book|
book_info = {
id: book['id'],
title: book.at('title')&.text,
author: book.at('author')&.text,
price: book.at('price')&.text,
currency: book.at('price')&.attr('currency'),
category: book.at('category')&.text
}
books_data << book_info
end
puts books_data.inspect
Advanced XML Parsing Techniques
Handling Namespaces
XML namespaces require special handling in Nokogiri:
xml_with_namespace = <<~XML
<?xml version="1.0"?>
<catalog xmlns:book="http://example.com/book">
<book:item id="1">
<book:title>Sample Book</book:title>
<book:author>Author Name</book:author>
</book:item>
</catalog>
XML
doc = Nokogiri::XML(xml_with_namespace)
# Define namespace for XPath queries
namespace = { 'book' => 'http://example.com/book' }
# Use namespace in XPath
items = doc.xpath('//book:item', namespace)
titles = doc.xpath('//book:title', namespace)
Parsing Large XML Documents
For large XML documents, consider using SAX parsing for better memory efficiency:
class BookHandler < Nokogiri::XML::SAX::Document
def initialize
@current_element = nil
@books = []
@current_book = {}
end
def start_element(name, attributes = [])
@current_element = name
if name == 'book'
@current_book = { id: attributes.find { |attr| attr[0] == 'id' }&.[](1) }
end
end
def characters(string)
case @current_element
when 'title'
@current_book[:title] = string.strip
when 'author'
@current_book[:author] = string.strip
when 'price'
@current_book[:price] = string.strip
end
end
def end_element(name)
if name == 'book'
@books << @current_book
@current_book = {}
end
@current_element = nil
end
attr_reader :books
end
# Use SAX parser
handler = BookHandler.new
parser = Nokogiri::XML::SAX::Parser.new(handler)
parser.parse(xml_string)
puts handler.books.inspect
Error Handling and Validation
Handling Parsing Errors
xml_with_errors = '<catalog><book><title>Unclosed tag</catalog>'
doc = Nokogiri::XML(xml_with_errors)
unless doc.errors.empty?
puts "Parsing errors found:"
doc.errors.each do |error|
puts " Line #{error.line}: #{error.message}"
end
end
# Parse with strict error handling
begin
doc = Nokogiri::XML(xml_with_errors) { |config| config.strict }
rescue Nokogiri::XML::SyntaxError => e
puts "Strict parsing failed: #{e.message}"
end
Validating Against XML Schema
# Load XML Schema
xsd = Nokogiri::XML::Schema(File.read('catalog.xsd'))
# Validate document
errors = xsd.validate(doc)
if errors.empty?
puts "Document is valid"
else
puts "Validation errors:"
errors.each { |error| puts " #{error.message}" }
end
Performance Optimization Tips
1. Use Appropriate Selectors
# Efficient: Use specific selectors
doc.at('catalog book[id="1"] title')
# Less efficient: Multiple queries
book = doc.css('book').find { |b| b['id'] == '1' }
title = book.at('title')
2. Cache Frequently Used Elements
# Cache the catalog element
catalog = doc.at('catalog')
# Use cached element for subsequent queries
books = catalog.css('book')
titles = catalog.css('title')
3. Use at
vs css
for Single Elements
# Use 'at' when you need only the first match
first_book = doc.at('book')
# Use 'css' when you need all matches
all_books = doc.css('book')
Integration with Web Scraping
When working with XML data in web scraping scenarios, Nokogiri integrates well with HTTP libraries. For JavaScript-heavy sites that generate XML dynamically, you might need browser automation tools to handle dynamic content that loads after page load, though Nokogiri excels at parsing the resulting XML once retrieved.
For complex scraping workflows that involve both XML parsing and browser automation, understanding how to handle timeouts becomes crucial when combining Nokogiri with other tools.
Complete Example: RSS Feed Parser
Here's a practical example of parsing an RSS feed:
require 'nokogiri'
require 'open-uri'
class RSSParser
def initialize(url)
@url = url
@doc = nil
end
def parse
begin
@doc = Nokogiri::XML(URI.open(@url))
if @doc.errors.any?
puts "Warning: XML parsing errors detected"
@doc.errors.each { |error| puts " #{error.message}" }
end
extract_items
rescue => e
puts "Error parsing RSS: #{e.message}"
[]
end
end
private
def extract_items
items = []
@doc.css('item').each do |item|
items << {
title: item.at('title')&.text&.strip,
link: item.at('link')&.text&.strip,
description: item.at('description')&.text&.strip,
pub_date: item.at('pubDate')&.text&.strip,
guid: item.at('guid')&.text&.strip
}
end
items
end
end
# Usage
parser = RSSParser.new('https://example.com/feed.xml')
items = parser.parse
items.each do |item|
puts "Title: #{item[:title]}"
puts "Link: #{item[:link]}"
puts "Published: #{item[:pub_date]}"
puts "---"
end
Best Practices
- Always handle parsing errors - Check
doc.errors
after parsing - Use safe navigation - Use
&.
operator to handle missing elements - Cache parsed documents - Avoid re-parsing the same XML multiple times
- Choose appropriate parsing methods - Use DOM parsing for small documents, SAX for large ones
- Validate inputs - Ensure XML is well-formed before processing
- Handle encodings properly - Specify encoding when dealing with non-UTF-8 content
Conclusion
Nokogiri provides a robust and efficient solution for parsing XML documents in Ruby applications. Its combination of XPath and CSS selector support, along with comprehensive error handling capabilities, makes it an excellent choice for both simple and complex XML processing tasks. Whether you're building web scrapers, processing configuration files, or working with API responses, mastering Nokogiri's XML parsing capabilities will significantly enhance your Ruby development toolkit.
The library's performance optimizations and memory-efficient parsing options ensure that your applications can handle XML documents of various sizes while maintaining good performance characteristics.