How does Nokogiri handle XML namespaces when querying documents?

Nokogiri is a popular Ruby library for parsing HTML and XML. It provides an easy-to-use interface for navigating and manipulating these types of documents. When it comes to XML, namespaces are often used to avoid element name conflicts and to ensure uniqueness across documents. Nokogiri has specific ways of handling XML namespaces when querying documents.

Here's how Nokogiri handles XML namespaces when querying documents:

Dealing with Namespaces

When Nokogiri parses an XML document, it keeps track of all the namespaces that are defined. To query elements that are within a namespace, you have to specify the namespace when using XPath or CSS selectors.

Using XPath

When using XPath to query namespaced elements, you can register a prefix with the Nokogiri::XML::Document#xpath method and then use that prefix in your XPath expressions.

Here's an example:

require 'nokogiri'

xml_str = <<-XML
<root xmlns:foo="http://example.com/foo">
  <foo:bar>Hello World</foo:bar>
</root>
XML

doc = Nokogiri::XML(xml_str)

# Register the namespace prefix 'f' for the URI
doc.xpath('//f:bar', 'f' => 'http://example.com/foo').each do |node|
  puts node.content
end

This will output:

Hello World

Using CSS

When using CSS selectors, you can query elements with namespaces by using the | (pipe) symbol to separate the namespace prefix and the element name. However, you need to define the namespace mappings first with the Nokogiri::XML::Document#css method.

Here's an example:

require 'nokogiri'

xml_str = <<-XML
<root xmlns:foo="http://example.com/foo">
  <foo:bar>Baz</foo:bar>
</root>
XML

doc = Nokogiri::XML(xml_str)

# Nokogiri allows CSS selectors on XML documents, but namespaces need to be declared
doc.css('foo|bar', 'foo' => 'http://example.com/foo').each do |node|
  puts node.content
end

This will output:

Baz

Ignoring Namespaces

Sometimes, you might want to ignore namespaces and just query the elements by their local name. Nokogiri provides a way to do this using the local-name() XPath function.

Here's an example:

require 'nokogiri'

xml_str = <<-XML
<root xmlns:foo="http://example.com/foo">
  <foo:bar>Qux</foo:bar>
</root>
XML

doc = Nokogiri::XML(xml_str)

# Ignore the namespace and select all 'bar' elements
doc.xpath('//*[local-name()="bar"]').each do |node|
  puts node.content
end

This will output:

Qux

Default Namespaces

If an XML element is defined with a default namespace (without a prefix), querying it can be a bit tricky because CSS selectors do not understand default namespaces. You will need to assign a prefix and use that in your XPath queries.

Here's an example:

require 'nokogiri'

xml_str = <<-XML
<root xmlns="http://example.com/default">
  <bar>Default Namespace</bar>
</root>
XML

doc = Nokogiri::XML(xml_str)

# Assign a prefix 'd' to the default namespace and use it in XPath
doc.xpath('//d:bar', 'd' => 'http://example.com/default').each do |node|
  puts node.content
end

This will output:

Default Namespace

In summary, Nokogiri provides flexible ways to handle XML namespaces. You can specify namespaces with prefixes when using XPath or CSS selectors, or you can choose to ignore them and query by local names. When dealing with default namespaces, you will need to assign a prefix for querying.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon