Table of contents

How can I validate XML documents using Nokogiri?

XML validation is a crucial process in ensuring that XML documents conform to specific structural and data requirements. Nokogiri, the powerful Ruby gem for parsing and manipulating XML and HTML documents, provides several methods for validating XML documents against different schema types. This comprehensive guide covers various validation approaches using Nokogiri, including XSD (XML Schema Definition), DTD (Document Type Definition), and RelaxNG validation.

Understanding XML Validation

XML validation verifies that an XML document adheres to a predefined structure and follows specific rules. This process helps ensure data integrity, consistency, and compliance with expected formats. Nokogiri supports multiple validation standards:

  • XSD (XML Schema Definition): The most comprehensive validation method
  • DTD (Document Type Definition): Legacy but still widely used
  • RelaxNG: A powerful alternative to XSD with simpler syntax

Installing and Setting Up Nokogiri

Before diving into validation examples, ensure you have Nokogiri installed:

gem install nokogiri

Or add it to your Gemfile:

gem 'nokogiri'

Then run:

bundle install

XSD Schema Validation

XSD validation is the most robust method for validating XML documents. Here's how to implement it with Nokogiri:

Basic XSD Validation

require 'nokogiri'

# Load the XSD schema
xsd_content = File.read('schema.xsd')
xsd = Nokogiri::XML::Schema(xsd_content)

# Load the XML document to validate
xml_content = File.read('document.xml')
xml_doc = Nokogiri::XML(xml_content)

# Perform validation
errors = xsd.validate(xml_doc)

if errors.empty?
  puts "XML document is valid!"
else
  puts "Validation errors found:"
  errors.each do |error|
    puts "Line #{error.line}: #{error.message}"
  end
end

Advanced XSD Validation with Error Handling

require 'nokogiri'

class XMLValidator
  def initialize(schema_path)
    @schema = load_schema(schema_path)
  end

  def validate(xml_path)
    xml_doc = load_xml_document(xml_path)
    validation_result = @schema.validate(xml_doc)

    {
      valid: validation_result.empty?,
      errors: format_errors(validation_result),
      document: xml_doc
    }
  end

  private

  def load_schema(schema_path)
    schema_content = File.read(schema_path)
    Nokogiri::XML::Schema(schema_content)
  rescue => e
    raise "Failed to load schema: #{e.message}"
  end

  def load_xml_document(xml_path)
    xml_content = File.read(xml_path)
    Nokogiri::XML(xml_content)
  rescue => e
    raise "Failed to load XML document: #{e.message}"
  end

  def format_errors(errors)
    errors.map do |error|
      {
        line: error.line,
        column: error.column,
        level: error.level,
        message: error.message.strip
      }
    end
  end
end

# Usage example
validator = XMLValidator.new('product_schema.xsd')
result = validator.validate('product_data.xml')

if result[:valid]
  puts "✅ XML document is valid"
else
  puts "❌ Validation failed with #{result[:errors].length} errors:"
  result[:errors].each do |error|
    puts "  Line #{error[:line]}: #{error[:message]}"
  end
end

Sample XSD Schema

Here's an example XSD schema for validating a product catalog:

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="catalog">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="product" maxOccurs="unbounded">
          <xs:complexType>
            <xs:sequence>
              <xs:element name="name" type="xs:string"/>
              <xs:element name="price" type="xs:decimal"/>
              <xs:element name="category" type="xs:string"/>
              <xs:element name="in_stock" type="xs:boolean"/>
            </xs:sequence>
            <xs:attribute name="id" type="xs:string" use="required"/>
          </xs:complexType>
        </xs:element>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
</xs:schema>

DTD Validation

DTD validation is an older but still relevant method for XML validation. Nokogiri supports DTD validation through the validate method:

require 'nokogiri'

# DTD content (can also be loaded from file)
dtd_content = <<~DTD
  <!ELEMENT books (book+)>
  <!ELEMENT book (title, author, year)>
  <!ELEMENT title (#PCDATA)>
  <!ELEMENT author (#PCDATA)>
  <!ELEMENT year (#PCDATA)>
  <!ATTLIST book id CDATA #REQUIRED>
DTD

# Create DTD object
dtd = Nokogiri::XML::DTD.parse(dtd_content)

# Load XML document
xml_content = <<~XML
  <?xml version="1.0"?>
  <books>
    <book id="1">
      <title>Ruby Programming</title>
      <author>John Doe</author>
      <year>2023</year>
    </book>
  </books>
XML

xml_doc = Nokogiri::XML(xml_content)

# Validate against DTD
errors = dtd.validate(xml_doc)

if errors.empty?
  puts "Document is valid according to DTD"
else
  puts "DTD validation errors:"
  errors.each { |error| puts error.message }
end

RelaxNG Validation

RelaxNG provides a more flexible and expressive validation approach:

require 'nokogiri'

# RelaxNG schema
rng_content = <<~RNG
  <grammar xmlns="http://relaxng.org/ns/structure/1.0">
    <start>
      <element name="person">
        <element name="name">
          <text/>
        </element>
        <element name="email">
          <text/>
        </element>
        <optional>
          <element name="phone">
            <text/>
          </element>
        </optional>
      </element>
    </start>
  </grammar>
RNG

# Create RelaxNG schema
rng = Nokogiri::XML::RelaxNG(rng_content)

# XML to validate
xml_content = <<~XML
  <?xml version="1.0"?>
  <person>
    <name>Jane Smith</name>
    <email>jane@example.com</email>
    <phone>555-1234</phone>
  </person>
XML

xml_doc = Nokogiri::XML(xml_content)

# Validate
errors = rng.validate(xml_doc)

if errors.empty?
  puts "Document is valid according to RelaxNG schema"
else
  errors.each { |error| puts "Error: #{error.message}" }
end

Handling Validation in Web Scraping Context

When parsing XML documents from web sources, validation becomes crucial for ensuring data quality. Here's how to integrate validation into a web scraping workflow:

require 'nokogiri'
require 'net/http'

class WebXMLValidator
  def initialize(schema_path)
    @schema = Nokogiri::XML::Schema(File.read(schema_path))
  end

  def scrape_and_validate(url)
    # Fetch XML from web source
    xml_content = fetch_xml(url)

    # Parse the XML
    xml_doc = Nokogiri::XML(xml_content)

    # Validate the document
    validation_errors = @schema.validate(xml_doc)

    if validation_errors.empty?
      process_valid_xml(xml_doc)
    else
      handle_validation_errors(validation_errors, url)
    end
  rescue => e
    puts "Error processing XML from #{url}: #{e.message}"
  end

  private

  def fetch_xml(url)
    uri = URI(url)
    response = Net::HTTP.get_response(uri)

    unless response.code == '200'
      raise "HTTP Error: #{response.code} #{response.message}"
    end

    response.body
  end

  def process_valid_xml(xml_doc)
    # Process the validated XML document
    puts "Successfully validated and processing XML"
    # Extract data, save to database, etc.
  end

  def handle_validation_errors(errors, source_url)
    puts "Validation errors for XML from #{source_url}:"
    errors.each do |error|
      puts "  Line #{error.line}: #{error.message}"
    end
  end
end

Custom Validation Rules

You can implement custom validation logic alongside schema validation:

class CustomXMLValidator
  def initialize(schema_path = nil)
    @schema = schema_path ? Nokogiri::XML::Schema(File.read(schema_path)) : nil
  end

  def validate_with_custom_rules(xml_doc)
    errors = []

    # Schema validation (if schema is provided)
    if @schema
      errors.concat(@schema.validate(xml_doc))
    end

    # Custom business logic validation
    errors.concat(validate_business_rules(xml_doc))

    errors
  end

  private

  def validate_business_rules(xml_doc)
    errors = []

    # Example: Ensure all prices are positive
    xml_doc.xpath('//price').each do |price_node|
      price_value = price_node.content.to_f
      if price_value <= 0
        errors << "Invalid price: #{price_value} at line #{price_node.line}"
      end
    end

    # Example: Ensure email format
    xml_doc.xpath('//email').each do |email_node|
      email = email_node.content
      unless email.match?(/\A[\w+\-.]+@[a-z\d\-]+(\.[a-z\d\-]+)*\.[a-z]+\z/i)
        errors << "Invalid email format: #{email} at line #{email_node.line}"
      end
    end

    errors
  end
end

Performance Considerations

When validating large XML documents or processing multiple files, consider these performance optimizations:

class OptimizedXMLValidator
  def initialize(schema_path)
    # Load schema once and reuse
    @schema = Nokogiri::XML::Schema(File.read(schema_path))
  end

  def validate_batch(xml_files)
    results = {}

    xml_files.each do |file_path|
      start_time = Time.now

      # Use strict parsing for better performance
      xml_doc = Nokogiri::XML(File.read(file_path)) do |config|
        config.strict.nonet
      end

      errors = @schema.validate(xml_doc)
      processing_time = Time.now - start_time

      results[file_path] = {
        valid: errors.empty?,
        errors: errors.map(&:message),
        processing_time: processing_time
      }
    end

    results
  end
end

Error Handling Best Practices

Implement robust error handling for production environments:

class ProductionXMLValidator
  class ValidationError < StandardError; end
  class SchemaLoadError < StandardError; end

  def initialize(schema_path)
    load_schema(schema_path)
  end

  def safe_validate(xml_content)
    begin
      xml_doc = parse_xml_safely(xml_content)
      errors = @schema.validate(xml_doc)

      {
        success: true,
        valid: errors.empty?,
        errors: errors.map { |e| format_error(e) },
        document: xml_doc
      }
    rescue => e
      {
        success: false,
        error: e.message,
        error_type: e.class.name
      }
    end
  end

  private

  def load_schema(schema_path)
    @schema = Nokogiri::XML::Schema(File.read(schema_path))
  rescue => e
    raise SchemaLoadError, "Failed to load schema from #{schema_path}: #{e.message}"
  end

  def parse_xml_safely(xml_content)
    Nokogiri::XML(xml_content) do |config|
      config.strict.nonet.noblanks
    end
  rescue Nokogiri::XML::SyntaxError => e
    raise ValidationError, "XML parsing failed: #{e.message}"
  end

  def format_error(error)
    {
      line: error.line,
      column: error.column,
      message: error.message.strip,
      severity: error.level
    }
  end
end

Conclusion

Nokogiri provides comprehensive XML validation capabilities through XSD, DTD, and RelaxNG schemas. Whether you're validating single documents, processing batch files, or integrating validation into web scraping workflows, Nokogiri's validation features ensure your XML data meets required standards.

Key takeaways for effective XML validation with Nokogiri:

  1. Choose the right schema type: XSD for comprehensive validation, DTD for legacy compatibility, RelaxNG for flexibility
  2. Implement proper error handling: Always handle parsing and validation errors gracefully
  3. Optimize for performance: Reuse schema objects and use strict parsing options
  4. Combine with custom validation: Supplement schema validation with business logic validation
  5. Monitor validation results: Track validation success rates and common error patterns

By following these practices and examples, you'll be able to implement robust XML validation in your Ruby applications using Nokogiri's powerful validation capabilities.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon