How can I validate XML documents using Nokogiri?
XML validation is a crucial process in ensuring that XML documents conform to specific structural and data requirements. Nokogiri, the powerful Ruby gem for parsing and manipulating XML and HTML documents, provides several methods for validating XML documents against different schema types. This comprehensive guide covers various validation approaches using Nokogiri, including XSD (XML Schema Definition), DTD (Document Type Definition), and RelaxNG validation.
Understanding XML Validation
XML validation verifies that an XML document adheres to a predefined structure and follows specific rules. This process helps ensure data integrity, consistency, and compliance with expected formats. Nokogiri supports multiple validation standards:
- XSD (XML Schema Definition): The most comprehensive validation method
- DTD (Document Type Definition): Legacy but still widely used
- RelaxNG: A powerful alternative to XSD with simpler syntax
Installing and Setting Up Nokogiri
Before diving into validation examples, ensure you have Nokogiri installed:
gem install nokogiri
Or add it to your Gemfile:
gem 'nokogiri'
Then run:
bundle install
XSD Schema Validation
XSD validation is the most robust method for validating XML documents. Here's how to implement it with Nokogiri:
Basic XSD Validation
require 'nokogiri'
# Load the XSD schema
xsd_content = File.read('schema.xsd')
xsd = Nokogiri::XML::Schema(xsd_content)
# Load the XML document to validate
xml_content = File.read('document.xml')
xml_doc = Nokogiri::XML(xml_content)
# Perform validation
errors = xsd.validate(xml_doc)
if errors.empty?
puts "XML document is valid!"
else
puts "Validation errors found:"
errors.each do |error|
puts "Line #{error.line}: #{error.message}"
end
end
Advanced XSD Validation with Error Handling
require 'nokogiri'
class XMLValidator
def initialize(schema_path)
@schema = load_schema(schema_path)
end
def validate(xml_path)
xml_doc = load_xml_document(xml_path)
validation_result = @schema.validate(xml_doc)
{
valid: validation_result.empty?,
errors: format_errors(validation_result),
document: xml_doc
}
end
private
def load_schema(schema_path)
schema_content = File.read(schema_path)
Nokogiri::XML::Schema(schema_content)
rescue => e
raise "Failed to load schema: #{e.message}"
end
def load_xml_document(xml_path)
xml_content = File.read(xml_path)
Nokogiri::XML(xml_content)
rescue => e
raise "Failed to load XML document: #{e.message}"
end
def format_errors(errors)
errors.map do |error|
{
line: error.line,
column: error.column,
level: error.level,
message: error.message.strip
}
end
end
end
# Usage example
validator = XMLValidator.new('product_schema.xsd')
result = validator.validate('product_data.xml')
if result[:valid]
puts "✅ XML document is valid"
else
puts "❌ Validation failed with #{result[:errors].length} errors:"
result[:errors].each do |error|
puts " Line #{error[:line]}: #{error[:message]}"
end
end
Sample XSD Schema
Here's an example XSD schema for validating a product catalog:
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="catalog">
<xs:complexType>
<xs:sequence>
<xs:element name="product" maxOccurs="unbounded">
<xs:complexType>
<xs:sequence>
<xs:element name="name" type="xs:string"/>
<xs:element name="price" type="xs:decimal"/>
<xs:element name="category" type="xs:string"/>
<xs:element name="in_stock" type="xs:boolean"/>
</xs:sequence>
<xs:attribute name="id" type="xs:string" use="required"/>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
DTD Validation
DTD validation is an older but still relevant method for XML validation. Nokogiri supports DTD validation through the validate
method:
require 'nokogiri'
# DTD content (can also be loaded from file)
dtd_content = <<~DTD
<!ELEMENT books (book+)>
<!ELEMENT book (title, author, year)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT author (#PCDATA)>
<!ELEMENT year (#PCDATA)>
<!ATTLIST book id CDATA #REQUIRED>
DTD
# Create DTD object
dtd = Nokogiri::XML::DTD.parse(dtd_content)
# Load XML document
xml_content = <<~XML
<?xml version="1.0"?>
<books>
<book id="1">
<title>Ruby Programming</title>
<author>John Doe</author>
<year>2023</year>
</book>
</books>
XML
xml_doc = Nokogiri::XML(xml_content)
# Validate against DTD
errors = dtd.validate(xml_doc)
if errors.empty?
puts "Document is valid according to DTD"
else
puts "DTD validation errors:"
errors.each { |error| puts error.message }
end
RelaxNG Validation
RelaxNG provides a more flexible and expressive validation approach:
require 'nokogiri'
# RelaxNG schema
rng_content = <<~RNG
<grammar xmlns="http://relaxng.org/ns/structure/1.0">
<start>
<element name="person">
<element name="name">
<text/>
</element>
<element name="email">
<text/>
</element>
<optional>
<element name="phone">
<text/>
</element>
</optional>
</element>
</start>
</grammar>
RNG
# Create RelaxNG schema
rng = Nokogiri::XML::RelaxNG(rng_content)
# XML to validate
xml_content = <<~XML
<?xml version="1.0"?>
<person>
<name>Jane Smith</name>
<email>jane@example.com</email>
<phone>555-1234</phone>
</person>
XML
xml_doc = Nokogiri::XML(xml_content)
# Validate
errors = rng.validate(xml_doc)
if errors.empty?
puts "Document is valid according to RelaxNG schema"
else
errors.each { |error| puts "Error: #{error.message}" }
end
Handling Validation in Web Scraping Context
When parsing XML documents from web sources, validation becomes crucial for ensuring data quality. Here's how to integrate validation into a web scraping workflow:
require 'nokogiri'
require 'net/http'
class WebXMLValidator
def initialize(schema_path)
@schema = Nokogiri::XML::Schema(File.read(schema_path))
end
def scrape_and_validate(url)
# Fetch XML from web source
xml_content = fetch_xml(url)
# Parse the XML
xml_doc = Nokogiri::XML(xml_content)
# Validate the document
validation_errors = @schema.validate(xml_doc)
if validation_errors.empty?
process_valid_xml(xml_doc)
else
handle_validation_errors(validation_errors, url)
end
rescue => e
puts "Error processing XML from #{url}: #{e.message}"
end
private
def fetch_xml(url)
uri = URI(url)
response = Net::HTTP.get_response(uri)
unless response.code == '200'
raise "HTTP Error: #{response.code} #{response.message}"
end
response.body
end
def process_valid_xml(xml_doc)
# Process the validated XML document
puts "Successfully validated and processing XML"
# Extract data, save to database, etc.
end
def handle_validation_errors(errors, source_url)
puts "Validation errors for XML from #{source_url}:"
errors.each do |error|
puts " Line #{error.line}: #{error.message}"
end
end
end
Custom Validation Rules
You can implement custom validation logic alongside schema validation:
class CustomXMLValidator
def initialize(schema_path = nil)
@schema = schema_path ? Nokogiri::XML::Schema(File.read(schema_path)) : nil
end
def validate_with_custom_rules(xml_doc)
errors = []
# Schema validation (if schema is provided)
if @schema
errors.concat(@schema.validate(xml_doc))
end
# Custom business logic validation
errors.concat(validate_business_rules(xml_doc))
errors
end
private
def validate_business_rules(xml_doc)
errors = []
# Example: Ensure all prices are positive
xml_doc.xpath('//price').each do |price_node|
price_value = price_node.content.to_f
if price_value <= 0
errors << "Invalid price: #{price_value} at line #{price_node.line}"
end
end
# Example: Ensure email format
xml_doc.xpath('//email').each do |email_node|
email = email_node.content
unless email.match?(/\A[\w+\-.]+@[a-z\d\-]+(\.[a-z\d\-]+)*\.[a-z]+\z/i)
errors << "Invalid email format: #{email} at line #{email_node.line}"
end
end
errors
end
end
Performance Considerations
When validating large XML documents or processing multiple files, consider these performance optimizations:
class OptimizedXMLValidator
def initialize(schema_path)
# Load schema once and reuse
@schema = Nokogiri::XML::Schema(File.read(schema_path))
end
def validate_batch(xml_files)
results = {}
xml_files.each do |file_path|
start_time = Time.now
# Use strict parsing for better performance
xml_doc = Nokogiri::XML(File.read(file_path)) do |config|
config.strict.nonet
end
errors = @schema.validate(xml_doc)
processing_time = Time.now - start_time
results[file_path] = {
valid: errors.empty?,
errors: errors.map(&:message),
processing_time: processing_time
}
end
results
end
end
Error Handling Best Practices
Implement robust error handling for production environments:
class ProductionXMLValidator
class ValidationError < StandardError; end
class SchemaLoadError < StandardError; end
def initialize(schema_path)
load_schema(schema_path)
end
def safe_validate(xml_content)
begin
xml_doc = parse_xml_safely(xml_content)
errors = @schema.validate(xml_doc)
{
success: true,
valid: errors.empty?,
errors: errors.map { |e| format_error(e) },
document: xml_doc
}
rescue => e
{
success: false,
error: e.message,
error_type: e.class.name
}
end
end
private
def load_schema(schema_path)
@schema = Nokogiri::XML::Schema(File.read(schema_path))
rescue => e
raise SchemaLoadError, "Failed to load schema from #{schema_path}: #{e.message}"
end
def parse_xml_safely(xml_content)
Nokogiri::XML(xml_content) do |config|
config.strict.nonet.noblanks
end
rescue Nokogiri::XML::SyntaxError => e
raise ValidationError, "XML parsing failed: #{e.message}"
end
def format_error(error)
{
line: error.line,
column: error.column,
message: error.message.strip,
severity: error.level
}
end
end
Conclusion
Nokogiri provides comprehensive XML validation capabilities through XSD, DTD, and RelaxNG schemas. Whether you're validating single documents, processing batch files, or integrating validation into web scraping workflows, Nokogiri's validation features ensure your XML data meets required standards.
Key takeaways for effective XML validation with Nokogiri:
- Choose the right schema type: XSD for comprehensive validation, DTD for legacy compatibility, RelaxNG for flexibility
- Implement proper error handling: Always handle parsing and validation errors gracefully
- Optimize for performance: Reuse schema objects and use strict parsing options
- Combine with custom validation: Supplement schema validation with business logic validation
- Monitor validation results: Track validation success rates and common error patterns
By following these practices and examples, you'll be able to implement robust XML validation in your Ruby applications using Nokogiri's powerful validation capabilities.