How to Parse XML Documents with Nokogiri

Nokogiri is Ruby's most powerful and popular XML/HTML parsing library, offering fast and efficient XML document processing capabilities. Whether you're working with configuration files, API responses, or data feeds, Nokogiri provides comprehensive tools for parsing, navigating, and manipulating XML documents.

What is Nokogiri?

Nokogiri is a Ruby gem that wraps the libxml2 C library, providing a simple and intuitive Ruby API for XML and HTML parsing. It supports XPath and CSS selectors, making it an excellent choice for both simple and complex XML processing tasks.

Installing Nokogiri

First, add Nokogiri to your Gemfile or install it directly:

# Using Bundler
echo 'gem "nokogiri"' >> Gemfile
bundle install

# Direct installation
gem install nokogiri

Basic XML Parsing

Parsing XML from a String

The most common way to parse XML is from a string using Nokogiri::XML():

require 'nokogiri'

xml_string = <<~XML
  <?xml version="1.0" encoding="UTF-8"?>
  <catalog>
    <book id="1">
      <title>Ruby Programming</title>
      <author>John Doe</author>
      <price currency="USD">29.99</price>
      <category>Programming</category>
    </book>
    <book id="2">
      <title>Web Scraping Guide</title>
      <author>Jane Smith</author>
      <price currency="EUR">24.50</price>
      <category>Web Development</category>
    </book>
  </catalog>
XML

# Parse the XML document
doc = Nokogiri::XML(xml_string)

# Check if parsing was successful
if doc.errors.empty?
  puts "XML parsed successfully"
else
  puts "Parsing errors: #{doc.errors}"
end

Parsing XML from a File

require 'nokogiri'

# Parse XML from a file
doc = Nokogiri::XML(File.open('catalog.xml'))

# Alternative using File.read
xml_content = File.read('catalog.xml')
doc = Nokogiri::XML(xml_content)

Parsing XML from a URL

require 'nokogiri'
require 'open-uri'

# Parse XML from a remote URL
url = 'https://example.com/data.xml'
doc = Nokogiri::XML(URI.open(url))

Navigating XML Documents

Accessing Elements by Tag Name

# Get all book elements
books = doc.xpath('//book')
# or using CSS selectors
books = doc.css('book')

books.each do |book|
  puts "Book ID: #{book['id']}"
  puts "Title: #{book.at('title').text}"
  puts "Author: #{book.at('author').text}"
  puts "---"
end

Using XPath Selectors

XPath provides powerful querying capabilities for XML documents:

# Find books with specific criteria
expensive_books = doc.xpath('//book[price > 25]')
programming_books = doc.xpath('//book[category="Programming"]')

# Get specific attributes
book_ids = doc.xpath('//book/@id').map(&:value)

# Find elements with specific attributes
usd_prices = doc.xpath('//price[@currency="USD"]')

Using CSS Selectors

CSS selectors offer a more familiar syntax for web developers:

# Select elements using CSS selectors
titles = doc.css('book title').map(&:text)
authors = doc.css('book author').map(&:text)

# Select elements with attributes
first_book = doc.css('book[id="1"]').first
usd_books = doc.css('price[currency="USD"]')

Extracting Data from XML

Getting Element Text Content

# Extract text from elements
book = doc.at('book')
title = book.at('title').text
author = book.at('author').text

# Handle missing elements safely
price_element = book.at('price')
price = price_element ? price_element.text : 'N/A'

Accessing Element Attributes

# Get attribute values
book = doc.at('book')
book_id = book['id']
# or using attr method
book_id = book.attr('id')

# Get all attributes
price_element = book.at('price')
currency = price_element['currency']
price_value = price_element.text

Working with Multiple Elements

# Process all books
books_data = []

doc.css('book').each do |book|
  book_info = {
    id: book['id'],
    title: book.at('title')&.text,
    author: book.at('author')&.text,
    price: book.at('price')&.text,
    currency: book.at('price')&.attr('currency'),
    category: book.at('category')&.text
  }
  books_data << book_info
end

puts books_data.inspect

Advanced XML Parsing Techniques

Handling Namespaces

XML namespaces require special handling in Nokogiri:

xml_with_namespace = <<~XML
  <?xml version="1.0"?>
  <catalog xmlns:book="http://example.com/book">
    <book:item id="1">
      <book:title>Sample Book</book:title>
      <book:author>Author Name</book:author>
    </book:item>
  </catalog>
XML

doc = Nokogiri::XML(xml_with_namespace)

# Define namespace for XPath queries
namespace = { 'book' => 'http://example.com/book' }

# Use namespace in XPath
items = doc.xpath('//book:item', namespace)
titles = doc.xpath('//book:title', namespace)

Parsing Large XML Documents

For large XML documents, consider using SAX parsing for better memory efficiency:

class BookHandler < Nokogiri::XML::SAX::Document
  def initialize
    @current_element = nil
    @books = []
    @current_book = {}
  end

  def start_element(name, attributes = [])
    @current_element = name
    if name == 'book'
      @current_book = { id: attributes.find { |attr| attr[0] == 'id' }&.[](1) }
    end
  end

  def characters(string)
    case @current_element
    when 'title'
      @current_book[:title] = string.strip
    when 'author'
      @current_book[:author] = string.strip
    when 'price'
      @current_book[:price] = string.strip
    end
  end

  def end_element(name)
    if name == 'book'
      @books << @current_book
      @current_book = {}
    end
    @current_element = nil
  end

  attr_reader :books
end

# Use SAX parser
handler = BookHandler.new
parser = Nokogiri::XML::SAX::Parser.new(handler)
parser.parse(xml_string)

puts handler.books.inspect

Error Handling and Validation

Handling Parsing Errors

xml_with_errors = '<catalog><book><title>Unclosed tag</catalog>'

doc = Nokogiri::XML(xml_with_errors)

unless doc.errors.empty?
  puts "Parsing errors found:"
  doc.errors.each do |error|
    puts "  Line #{error.line}: #{error.message}"
  end
end

# Parse with strict error handling
begin
  doc = Nokogiri::XML(xml_with_errors) { |config| config.strict }
rescue Nokogiri::XML::SyntaxError => e
  puts "Strict parsing failed: #{e.message}"
end

Validating Against XML Schema

# Load XML Schema
xsd = Nokogiri::XML::Schema(File.read('catalog.xsd'))

# Validate document
errors = xsd.validate(doc)

if errors.empty?
  puts "Document is valid"
else
  puts "Validation errors:"
  errors.each { |error| puts "  #{error.message}" }
end

Performance Optimization Tips

1. Use Appropriate Selectors

# Efficient: Use specific selectors
doc.at('catalog book[id="1"] title')

# Less efficient: Multiple queries
book = doc.css('book').find { |b| b['id'] == '1' }
title = book.at('title')

2. Cache Frequently Used Elements

# Cache the catalog element
catalog = doc.at('catalog')

# Use cached element for subsequent queries
books = catalog.css('book')
titles = catalog.css('title')

3. Use `at` vs `css` for Single Elements

# Use 'at' when you need only the first match
first_book = doc.at('book')

# Use 'css' when you need all matches
all_books = doc.css('book')

Integration with Web Scraping

When working with XML data in web scraping scenarios, Nokogiri integrates well with HTTP libraries. For JavaScript-heavy sites that generate XML dynamically, you might need browser automation tools to handle dynamic content that loads after page load, though Nokogiri excels at parsing the resulting XML once retrieved.

For complex scraping workflows that involve both XML parsing and browser automation, understanding how to handle timeouts becomes crucial when combining Nokogiri with other tools.

Complete Example: RSS Feed Parser

Here's a practical example of parsing an RSS feed:

require 'nokogiri'
require 'open-uri'

class RSSParser
  def initialize(url)
    @url = url
    @doc = nil
  end

  def parse
    begin
      @doc = Nokogiri::XML(URI.open(@url))

      if @doc.errors.any?
        puts "Warning: XML parsing errors detected"
        @doc.errors.each { |error| puts "  #{error.message}" }
      end

      extract_items
    rescue => e
      puts "Error parsing RSS: #{e.message}"
      []
    end
  end

  private

  def extract_items
    items = []

    @doc.css('item').each do |item|
      items << {
        title: item.at('title')&.text&.strip,
        link: item.at('link')&.text&.strip,
        description: item.at('description')&.text&.strip,
        pub_date: item.at('pubDate')&.text&.strip,
        guid: item.at('guid')&.text&.strip
      }
    end

    items
  end
end

# Usage
parser = RSSParser.new('https://example.com/feed.xml')
items = parser.parse

items.each do |item|
  puts "Title: #{item[:title]}"
  puts "Link: #{item[:link]}"
  puts "Published: #{item[:pub_date]}"
  puts "---"
end

Best Practices

Always handle parsing errors - Check doc.errors after parsing
Use safe navigation - Use &. operator to handle missing elements
Cache parsed documents - Avoid re-parsing the same XML multiple times
Choose appropriate parsing methods - Use DOM parsing for small documents, SAX for large ones
Validate inputs - Ensure XML is well-formed before processing
Handle encodings properly - Specify encoding when dealing with non-UTF-8 content

Conclusion

Nokogiri provides a robust and efficient solution for parsing XML documents in Ruby applications. Its combination of XPath and CSS selector support, along with comprehensive error handling capabilities, makes it an excellent choice for both simple and complex XML processing tasks. Whether you're building web scrapers, processing configuration files, or working with API responses, mastering Nokogiri's XML parsing capabilities will significantly enhance your Ruby development toolkit.

The library's performance optimizations and memory-efficient parsing options ensure that your applications can handle XML documents of various sizes while maintaining good performance characteristics.

Table of contents

How to Parse XML Documents with Nokogiri

What is Nokogiri?

Installing Nokogiri

Basic XML Parsing

Parsing XML from a String

Parsing XML from a File

Parsing XML from a URL

Navigating XML Documents

Accessing Elements by Tag Name

Using XPath Selectors

Using CSS Selectors

Extracting Data from XML

Getting Element Text Content

Accessing Element Attributes

Working with Multiple Elements

Advanced XML Parsing Techniques

Handling Namespaces

Parsing Large XML Documents

Error Handling and Validation

Handling Parsing Errors

Validating Against XML Schema

Performance Optimization Tips

1. Use Appropriate Selectors

2. Cache Frequently Used Elements

3. Use `at` vs `css` for Single Elements

Integration with Web Scraping

Complete Example: RSS Feed Parser

Best Practices

Conclusion

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Related Questions

What is the best way to handle encoding issues in Nokogiri?

How do I create new HTML elements with Nokogiri?

How can I modify existing HTML content using Nokogiri?

Get Started Now

Support

Table of contents

How to Parse XML Documents with Nokogiri

What is Nokogiri?

Installing Nokogiri

Basic XML Parsing

Parsing XML from a String

Parsing XML from a File

Parsing XML from a URL

Navigating XML Documents

Accessing Elements by Tag Name

Using XPath Selectors

Using CSS Selectors

Extracting Data from XML

Getting Element Text Content

Accessing Element Attributes

Working with Multiple Elements

Advanced XML Parsing Techniques

Handling Namespaces

Parsing Large XML Documents

Error Handling and Validation

Handling Parsing Errors

Validating Against XML Schema

Performance Optimization Tips

1. Use Appropriate Selectors

2. Cache Frequently Used Elements

3. Use at vs css for Single Elements

Integration with Web Scraping

Complete Example: RSS Feed Parser

Best Practices

Conclusion

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Related Questions

What is the best way to handle encoding issues in Nokogiri?

How do I create new HTML elements with Nokogiri?

How can I modify existing HTML content using Nokogiri?

Get Started Now

Support

3. Use `at` vs `css` for Single Elements