How do I scrape XML data using Ruby and what tools should I use?

Ruby provides several powerful libraries for parsing and scraping XML data, making it an excellent choice for XML processing tasks. This comprehensive guide covers the best tools available, implementation strategies, and practical examples for effective XML scraping with Ruby.

Popular Ruby XML Libraries

1. Nokogiri (Recommended)

Nokogiri is the most popular and feature-rich XML/HTML parsing library for Ruby. It's built on top of libxml2 and libxslt, offering excellent performance and comprehensive functionality.

Installation:

gem install nokogiri

Basic XML Parsing Example:

require 'nokogiri'
require 'open-uri'

# Parse XML from a URL
xml_doc = Nokogiri::XML(URI.open('https://example.com/data.xml'))

# Parse XML from a string
xml_string = <<~XML
  <?xml version="1.0" encoding="UTF-8"?>
  <books>
    <book id="1">
      <title>Ruby Programming</title>
      <author>John Doe</author>
      <price currency="USD">29.99</price>
    </book>
    <book id="2">
      <title>Web Scraping Guide</title>
      <author>Jane Smith</author>
      <price currency="EUR">24.99</price>
    </book>
  </books>
XML

doc = Nokogiri::XML(xml_string)

# Extract data using CSS selectors
books = doc.css('book')
books.each do |book|
  title = book.css('title').text
  author = book.css('author').text
  price = book.css('price').text
  currency = book.css('price').attr('currency')

  puts "Title: #{title}"
  puts "Author: #{author}"
  puts "Price: #{price} #{currency}"
  puts "---"
end

Advanced Nokogiri Features:

require 'nokogiri'

# XPath selectors for complex queries
doc = Nokogiri::XML(xml_string)

# Find books with price greater than 25
expensive_books = doc.xpath('//book[price > 25]')

# Find books by specific author
john_books = doc.xpath('//book[author="John Doe"]')

# Extract specific attributes
book_ids = doc.xpath('//book/@id').map(&:value)

# Namespace handling
xml_with_namespace = <<~XML
  <?xml version="1.0"?>
  <catalog xmlns:book="http://example.com/book">
    <book:item>
      <book:title>Ruby Guide</book:title>
    </book:item>
  </catalog>
XML

ns_doc = Nokogiri::XML(xml_with_namespace)
title = ns_doc.at_xpath('//book:title', 'book' => 'http://example.com/book').text

2. REXML (Built-in)

REXML is Ruby's built-in XML library, making it available without additional dependencies. While slower than Nokogiri, it's sufficient for smaller XML processing tasks.

require 'rexml/document'
require 'open-uri'

# Parse XML document
xml_data = URI.open('https://example.com/feed.xml').read
doc = REXML::Document.new(xml_data)

# Navigate through elements
doc.elements.each('//item') do |item|
  title = item.elements['title'].text
  description = item.elements['description'].text
  link = item.elements['link'].text

  puts "Title: #{title}"
  puts "Description: #{description}"
  puts "Link: #{link}"
  puts "---"
end

# Using XPath
titles = REXML::XPath.match(doc, '//item/title')
titles.each { |title| puts title.text }

3. Ox (High Performance)

Ox is a fast XML parser optimized for performance, especially useful when processing large XML files.

Installation:

gem install ox

require 'ox'

# Parse XML string
xml = <<~XML
  <root>
    <users>
      <user id="1" name="Alice" email="alice@example.com"/>
      <user id="2" name="Bob" email="bob@example.com"/>
    </users>
  </root>
XML

doc = Ox.parse(xml)

# Access elements
users = doc.users
users.each do |user|
  puts "ID: #{user[:id]}"
  puts "Name: #{user[:name]}"
  puts "Email: #{user[:email]}"
  puts "---"
end

Web Scraping XML from URLs

Using Net::HTTP with Nokogiri

require 'nokogiri'
require 'net/http'
require 'uri'

def scrape_xml_feed(url)
  uri = URI(url)
  response = Net::HTTP.get_response(uri)

  if response.code == '200'
    doc = Nokogiri::XML(response.body)

    # Extract RSS feed items
    items = []
    doc.css('item').each do |item|
      items << {
        title: item.css('title').text,
        description: item.css('description').text,
        link: item.css('link').text,
        pub_date: item.css('pubDate').text
      }
    end

    return items
  else
    puts "Error: #{response.code} #{response.message}"
    return []
  end
end

# Usage
feed_url = 'https://example.com/rss.xml'
articles = scrape_xml_feed(feed_url)
articles.each { |article| puts article[:title] }

Using HTTParty for Enhanced HTTP Handling

require 'httparty'
require 'nokogiri'

class XMLScraper
  include HTTParty

  def initialize(base_url)
    @base_url = base_url
    self.class.headers({
      'User-Agent' => 'Mozilla/5.0 (compatible; XMLScraper/1.0)',
      'Accept' => 'application/xml, text/xml'
    })
  end

  def fetch_and_parse(endpoint)
    response = self.class.get("#{@base_url}/#{endpoint}")

    if response.success?
      Nokogiri::XML(response.body)
    else
      raise "Failed to fetch XML: #{response.code}"
    end
  end
end

# Usage
scraper = XMLScraper.new('https://api.example.com')
doc = scraper.fetch_and_parse('feed.xml')

Handling Different XML Formats

RSS Feeds

def parse_rss_feed(xml_content)
  doc = Nokogiri::XML(xml_content)

  feed_info = {
    title: doc.css('channel title').first&.text,
    description: doc.css('channel description').first&.text,
    items: []
  }

  doc.css('item').each do |item|
    feed_info[:items] << {
      title: item.css('title').text,
      description: item.css('description').text,
      link: item.css('link').text,
      guid: item.css('guid').text,
      pub_date: Time.parse(item.css('pubDate').text) rescue nil
    }
  end

  feed_info
end

Atom Feeds

def parse_atom_feed(xml_content)
  doc = Nokogiri::XML(xml_content)
  doc.remove_namespaces! # Simplify namespace handling

  {
    title: doc.css('feed title').text,
    subtitle: doc.css('feed subtitle').text,
    entries: doc.css('entry').map do |entry|
      {
        title: entry.css('title').text,
        summary: entry.css('summary').text,
        link: entry.css('link').first['href'],
        updated: Time.parse(entry.css('updated').text) rescue nil
      }
    end
  }
end

XML Sitemaps

def parse_sitemap(xml_content)
  doc = Nokogiri::XML(xml_content)
  doc.remove_namespaces!

  urls = []
  doc.css('url').each do |url_element|
    urls << {
      loc: url_element.css('loc').text,
      lastmod: url_element.css('lastmod').text,
      changefreq: url_element.css('changefreq').text,
      priority: url_element.css('priority').text.to_f
    }
  end

  urls
end

Advanced XML Scraping Techniques

Error Handling and Validation

require 'nokogiri'

def safe_xml_parse(xml_content)
  begin
    doc = Nokogiri::XML(xml_content) do |config|
      config.strict.nonet # Disable network access during parsing
    end

    # Check for parsing errors
    if doc.errors.any?
      puts "XML parsing errors:"
      doc.errors.each { |error| puts "  #{error}" }
      return nil
    end

    return doc
  rescue Nokogiri::XML::SyntaxError => e
    puts "Invalid XML: #{e.message}"
    return nil
  end
end

# Validate against XML Schema
def validate_xml_schema(xml_doc, schema_path)
  schema = Nokogiri::XML::Schema(File.read(schema_path))
  errors = schema.validate(xml_doc)

  if errors.empty?
    puts "XML is valid"
    return true
  else
    puts "Validation errors:"
    errors.each { |error| puts "  #{error}" }
    return false
  end
end

Streaming Large XML Files

For large XML files, use streaming parsers to avoid memory issues:

require 'nokogiri'

class XMLStreamer < Nokogiri::XML::SAX::Document
  def initialize
    @current_element = nil
    @items = []
  end

  def start_element(name, attributes = [])
    @current_element = name
    @current_attributes = Hash[attributes]
    @current_text = ""
  end

  def characters(text)
    @current_text += text if @current_text
  end

  def end_element(name)
    if name == 'item' && @current_text
      @items << {
        content: @current_text.strip,
        attributes: @current_attributes
      }
    end
    @current_element = nil
  end

  attr_reader :items
end

# Usage
streamer = XMLStreamer.new
parser = Nokogiri::XML::SAX::Parser.new(streamer)
parser.parse_file('large_file.xml')

puts "Processed #{streamer.items.count} items"

Handling Complex Nested Structures

def extract_nested_data(xml_content)
  doc = Nokogiri::XML(xml_content)

  # Example: Extract product catalog with categories and subcategories
  categories = []

  doc.css('category').each do |category|
    category_data = {
      name: category['name'],
      id: category['id'],
      subcategories: [],
      products: []
    }

    # Extract subcategories
    category.css('subcategory').each do |subcategory|
      category_data[:subcategories] << {
        name: subcategory['name'],
        id: subcategory['id']
      }
    end

    # Extract products
    category.css('product').each do |product|
      category_data[:products] << {
        name: product.css('name').text,
        price: product.css('price').text.to_f,
        description: product.css('description').text,
        attributes: extract_product_attributes(product)
      }
    end

    categories << category_data
  end

  categories
end

def extract_product_attributes(product_element)
  attributes = {}
  product_element.css('attribute').each do |attr|
    attributes[attr['name']] = attr.text
  end
  attributes
end

Best Practices for XML Scraping

1. Handle Encoding Issues

require 'nokogiri'

def parse_xml_with_encoding(xml_content)
  # Force UTF-8 encoding
  xml_content = xml_content.force_encoding('UTF-8')

  # Handle invalid characters
  xml_content = xml_content.scrub('?')

  Nokogiri::XML(xml_content)
end

2. Implement Retry Logic

require 'nokogiri'
require 'net/http'

def fetch_xml_with_retry(url, max_retries = 3)
  retries = 0

  begin
    uri = URI(url)
    response = Net::HTTP.get_response(uri)

    if response.code == '200'
      return Nokogiri::XML(response.body)
    else
      raise "HTTP Error: #{response.code}"
    end

  rescue => e
    retries += 1
    if retries <= max_retries
      puts "Retry #{retries}/#{max_retries}: #{e.message}"
      sleep(2 ** retries) # Exponential backoff
      retry
    else
      raise "Failed after #{max_retries} retries: #{e.message}"
    end
  end
end

3. Performance Optimization

# Use CSS selectors instead of XPath when possible (faster)
# Good
doc.css('item title')

# Less efficient
doc.xpath('//item/title')

# Cache frequently accessed elements
items = doc.css('item') # Cache this
items.each do |item|
  title = item.css('title').text
  author = item.css('author').text
end

# Remove namespaces if not needed
doc.remove_namespaces!

# Use at_css for single elements instead of css.first
title = doc.at_css('title').text

4. Memory Management for Large Files

def process_large_xml_efficiently(file_path)
  # Use streaming for large files
  if File.size(file_path) > 100_000_000 # 100MB
    process_with_sax_parser(file_path)
  else
    # Use regular parsing for smaller files
    doc = Nokogiri::XML(File.read(file_path))
    process_xml_document(doc)
  end
end

def process_with_sax_parser(file_path)
  handler = CustomSAXHandler.new
  parser = Nokogiri::XML::SAX::Parser.new(handler)
  parser.parse_file(file_path)
  handler.results
end

Integration with Dynamic Content

When dealing with XML content that's generated dynamically by JavaScript, you might need to combine Ruby with browser automation tools. For sites that load XML data via AJAX requests, consider using techniques for handling AJAX requests in web scraping before processing the XML with Ruby.

For complex scenarios involving authentication or session management, you can leverage browser session handling techniques to obtain the necessary XML data first.

Testing Your XML Scrapers

require 'rspec'
require 'nokogiri'

RSpec.describe 'XML Scraper' do
  let(:sample_xml) do
    <<~XML
      <?xml version="1.0"?>
      <products>
        <product id="1">
          <name>Test Product</name>
          <price>19.99</price>
        </product>
      </products>
    XML
  end

  it 'parses XML correctly' do
    doc = Nokogiri::XML(sample_xml)
    product = doc.at_css('product')

    expect(product['id']).to eq('1')
    expect(product.at_css('name').text).to eq('Test Product')
    expect(product.at_css('price').text).to eq('19.99')
  end

  it 'handles malformed XML gracefully' do
    malformed_xml = '<unclosed><tag>'
    doc = Nokogiri::XML(malformed_xml)

    expect(doc.errors).not_to be_empty
  end
end

Conclusion

Ruby offers excellent tools for XML scraping, with Nokogiri being the go-to choice for most applications due to its performance, feature completeness, and active maintenance. REXML works well for simple tasks without external dependencies, while Ox provides superior performance for large-scale XML processing.

When scraping XML data, always consider error handling, encoding issues, and performance implications. For complex scraping scenarios involving dynamic content, you may need to combine Ruby XML parsing with browser automation tools or specialized web scraping APIs.

The key to successful XML scraping with Ruby is choosing the right tool for your specific use case and implementing robust error handling and validation mechanisms to ensure reliable data extraction. Whether you're processing RSS feeds, API responses, or complex XML documents, Ruby's rich ecosystem provides the tools you need for effective XML scraping.

Table of contents