Table of contents

How to Parse XML Documents with Nokogiri

Nokogiri is Ruby's most powerful and popular XML/HTML parsing library, offering fast and efficient XML document processing capabilities. Whether you're working with configuration files, API responses, or data feeds, Nokogiri provides comprehensive tools for parsing, navigating, and manipulating XML documents.

What is Nokogiri?

Nokogiri is a Ruby gem that wraps the libxml2 C library, providing a simple and intuitive Ruby API for XML and HTML parsing. It supports XPath and CSS selectors, making it an excellent choice for both simple and complex XML processing tasks.

Installing Nokogiri

First, add Nokogiri to your Gemfile or install it directly:

# Using Bundler
echo 'gem "nokogiri"' >> Gemfile
bundle install

# Direct installation
gem install nokogiri

Basic XML Parsing

Parsing XML from a String

The most common way to parse XML is from a string using Nokogiri::XML():

require 'nokogiri'

xml_string = <<~XML
  <?xml version="1.0" encoding="UTF-8"?>
  <catalog>
    <book id="1">
      <title>Ruby Programming</title>
      <author>John Doe</author>
      <price currency="USD">29.99</price>
      <category>Programming</category>
    </book>
    <book id="2">
      <title>Web Scraping Guide</title>
      <author>Jane Smith</author>
      <price currency="EUR">24.50</price>
      <category>Web Development</category>
    </book>
  </catalog>
XML

# Parse the XML document
doc = Nokogiri::XML(xml_string)

# Check if parsing was successful
if doc.errors.empty?
  puts "XML parsed successfully"
else
  puts "Parsing errors: #{doc.errors}"
end

Parsing XML from a File

require 'nokogiri'

# Parse XML from a file
doc = Nokogiri::XML(File.open('catalog.xml'))

# Alternative using File.read
xml_content = File.read('catalog.xml')
doc = Nokogiri::XML(xml_content)

Parsing XML from a URL

require 'nokogiri'
require 'open-uri'

# Parse XML from a remote URL
url = 'https://example.com/data.xml'
doc = Nokogiri::XML(URI.open(url))

Navigating XML Documents

Accessing Elements by Tag Name

# Get all book elements
books = doc.xpath('//book')
# or using CSS selectors
books = doc.css('book')

books.each do |book|
  puts "Book ID: #{book['id']}"
  puts "Title: #{book.at('title').text}"
  puts "Author: #{book.at('author').text}"
  puts "---"
end

Using XPath Selectors

XPath provides powerful querying capabilities for XML documents:

# Find books with specific criteria
expensive_books = doc.xpath('//book[price > 25]')
programming_books = doc.xpath('//book[category="Programming"]')

# Get specific attributes
book_ids = doc.xpath('//book/@id').map(&:value)

# Find elements with specific attributes
usd_prices = doc.xpath('//price[@currency="USD"]')

Using CSS Selectors

CSS selectors offer a more familiar syntax for web developers:

# Select elements using CSS selectors
titles = doc.css('book title').map(&:text)
authors = doc.css('book author').map(&:text)

# Select elements with attributes
first_book = doc.css('book[id="1"]').first
usd_books = doc.css('price[currency="USD"]')

Extracting Data from XML

Getting Element Text Content

# Extract text from elements
book = doc.at('book')
title = book.at('title').text
author = book.at('author').text

# Handle missing elements safely
price_element = book.at('price')
price = price_element ? price_element.text : 'N/A'

Accessing Element Attributes

# Get attribute values
book = doc.at('book')
book_id = book['id']
# or using attr method
book_id = book.attr('id')

# Get all attributes
price_element = book.at('price')
currency = price_element['currency']
price_value = price_element.text

Working with Multiple Elements

# Process all books
books_data = []

doc.css('book').each do |book|
  book_info = {
    id: book['id'],
    title: book.at('title')&.text,
    author: book.at('author')&.text,
    price: book.at('price')&.text,
    currency: book.at('price')&.attr('currency'),
    category: book.at('category')&.text
  }
  books_data << book_info
end

puts books_data.inspect

Advanced XML Parsing Techniques

Handling Namespaces

XML namespaces require special handling in Nokogiri:

xml_with_namespace = <<~XML
  <?xml version="1.0"?>
  <catalog xmlns:book="http://example.com/book">
    <book:item id="1">
      <book:title>Sample Book</book:title>
      <book:author>Author Name</book:author>
    </book:item>
  </catalog>
XML

doc = Nokogiri::XML(xml_with_namespace)

# Define namespace for XPath queries
namespace = { 'book' => 'http://example.com/book' }

# Use namespace in XPath
items = doc.xpath('//book:item', namespace)
titles = doc.xpath('//book:title', namespace)

Parsing Large XML Documents

For large XML documents, consider using SAX parsing for better memory efficiency:

class BookHandler < Nokogiri::XML::SAX::Document
  def initialize
    @current_element = nil
    @books = []
    @current_book = {}
  end

  def start_element(name, attributes = [])
    @current_element = name
    if name == 'book'
      @current_book = { id: attributes.find { |attr| attr[0] == 'id' }&.[](1) }
    end
  end

  def characters(string)
    case @current_element
    when 'title'
      @current_book[:title] = string.strip
    when 'author'
      @current_book[:author] = string.strip
    when 'price'
      @current_book[:price] = string.strip
    end
  end

  def end_element(name)
    if name == 'book'
      @books << @current_book
      @current_book = {}
    end
    @current_element = nil
  end

  attr_reader :books
end

# Use SAX parser
handler = BookHandler.new
parser = Nokogiri::XML::SAX::Parser.new(handler)
parser.parse(xml_string)

puts handler.books.inspect

Error Handling and Validation

Handling Parsing Errors

xml_with_errors = '<catalog><book><title>Unclosed tag</catalog>'

doc = Nokogiri::XML(xml_with_errors)

unless doc.errors.empty?
  puts "Parsing errors found:"
  doc.errors.each do |error|
    puts "  Line #{error.line}: #{error.message}"
  end
end

# Parse with strict error handling
begin
  doc = Nokogiri::XML(xml_with_errors) { |config| config.strict }
rescue Nokogiri::XML::SyntaxError => e
  puts "Strict parsing failed: #{e.message}"
end

Validating Against XML Schema

# Load XML Schema
xsd = Nokogiri::XML::Schema(File.read('catalog.xsd'))

# Validate document
errors = xsd.validate(doc)

if errors.empty?
  puts "Document is valid"
else
  puts "Validation errors:"
  errors.each { |error| puts "  #{error.message}" }
end

Performance Optimization Tips

1. Use Appropriate Selectors

# Efficient: Use specific selectors
doc.at('catalog book[id="1"] title')

# Less efficient: Multiple queries
book = doc.css('book').find { |b| b['id'] == '1' }
title = book.at('title')

2. Cache Frequently Used Elements

# Cache the catalog element
catalog = doc.at('catalog')

# Use cached element for subsequent queries
books = catalog.css('book')
titles = catalog.css('title')

3. Use at vs css for Single Elements

# Use 'at' when you need only the first match
first_book = doc.at('book')

# Use 'css' when you need all matches
all_books = doc.css('book')

Integration with Web Scraping

When working with XML data in web scraping scenarios, Nokogiri integrates well with HTTP libraries. For JavaScript-heavy sites that generate XML dynamically, you might need browser automation tools to handle dynamic content that loads after page load, though Nokogiri excels at parsing the resulting XML once retrieved.

For complex scraping workflows that involve both XML parsing and browser automation, understanding how to handle timeouts becomes crucial when combining Nokogiri with other tools.

Complete Example: RSS Feed Parser

Here's a practical example of parsing an RSS feed:

require 'nokogiri'
require 'open-uri'

class RSSParser
  def initialize(url)
    @url = url
    @doc = nil
  end

  def parse
    begin
      @doc = Nokogiri::XML(URI.open(@url))

      if @doc.errors.any?
        puts "Warning: XML parsing errors detected"
        @doc.errors.each { |error| puts "  #{error.message}" }
      end

      extract_items
    rescue => e
      puts "Error parsing RSS: #{e.message}"
      []
    end
  end

  private

  def extract_items
    items = []

    @doc.css('item').each do |item|
      items << {
        title: item.at('title')&.text&.strip,
        link: item.at('link')&.text&.strip,
        description: item.at('description')&.text&.strip,
        pub_date: item.at('pubDate')&.text&.strip,
        guid: item.at('guid')&.text&.strip
      }
    end

    items
  end
end

# Usage
parser = RSSParser.new('https://example.com/feed.xml')
items = parser.parse

items.each do |item|
  puts "Title: #{item[:title]}"
  puts "Link: #{item[:link]}"
  puts "Published: #{item[:pub_date]}"
  puts "---"
end

Best Practices

  1. Always handle parsing errors - Check doc.errors after parsing
  2. Use safe navigation - Use &. operator to handle missing elements
  3. Cache parsed documents - Avoid re-parsing the same XML multiple times
  4. Choose appropriate parsing methods - Use DOM parsing for small documents, SAX for large ones
  5. Validate inputs - Ensure XML is well-formed before processing
  6. Handle encodings properly - Specify encoding when dealing with non-UTF-8 content

Conclusion

Nokogiri provides a robust and efficient solution for parsing XML documents in Ruby applications. Its combination of XPath and CSS selector support, along with comprehensive error handling capabilities, makes it an excellent choice for both simple and complex XML processing tasks. Whether you're building web scrapers, processing configuration files, or working with API responses, mastering Nokogiri's XML parsing capabilities will significantly enhance your Ruby development toolkit.

The library's performance optimizations and memory-efficient parsing options ensure that your applications can handle XML documents of various sizes while maintaining good performance characteristics.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon