What is XPath?
XPath (XML Path Language) is a powerful query language used to navigate and select nodes in XML and HTML documents. It treats documents as tree structures and uses path expressions similar to file system paths to locate elements. XPath is particularly valuable for web scraping because it provides precise control over element selection, even in complex nested HTML structures.
Unlike CSS selectors, XPath can traverse both forward and backward through the document tree, making it ideal for complex data extraction scenarios.
Setting Up XPath with Nokogiri
Nokogiri is Ruby's most popular HTML/XML parsing library and provides excellent XPath support:
# Add to your Gemfile
gem 'nokogiri'
# Or install directly
# gem install nokogiri
Basic XPath Usage in Ruby
Here's a complete example demonstrating XPath with Nokogiri:
require 'nokogiri'
require 'net/http'
require 'uri'
# Fetch and parse HTML document
def fetch_and_parse(url)
uri = URI(url)
response = Net::HTTP.get_response(uri)
Nokogiri::HTML(response.body)
rescue => e
puts "Error fetching #{url}: #{e.message}"
nil
end
# Example usage
doc = fetch_and_parse('https://example.com')
return unless doc
# Basic element selection
titles = doc.xpath('//h1')
titles.each { |title| puts title.text.strip }
Essential XPath Syntax
Basic Selectors
# Absolute path from root
doc.xpath('/html/body/div')
# Relative path - anywhere in document
doc.xpath('//div')
# Current node
doc.xpath('.')
# Parent node
doc.xpath('..')
# All child nodes
doc.xpath('*')
Attribute Selection
# Select by attribute value
doc.xpath('//div[@class="content"]')
# Select by partial attribute match
doc.xpath('//div[contains(@class, "product")]')
# Select by attribute existence
doc.xpath('//img[@alt]')
# Get attribute values
hrefs = doc.xpath('//a/@href')
hrefs.each { |href| puts href.value }
Position-based Selection
# First element
doc.xpath('//div[1]')
# Last element
doc.xpath('//div[last()]')
# First two elements
doc.xpath('//div[position() <= 2]')
# All but first element
doc.xpath('//div[position() > 1]')
Advanced XPath Techniques
Text Content Selection
# Elements containing specific text
doc.xpath('//h2[contains(text(), "Product")]')
# Elements with exact text match
doc.xpath('//button[text()="Submit"]')
# Get text content directly
prices = doc.xpath('//span[@class="price"]/text()')
prices.each { |price| puts price.to_s.strip }
Complex Predicates
# Multiple conditions with 'and'
doc.xpath('//div[@class="item" and @data-id]')
# Multiple conditions with 'or'
doc.xpath('//input[@type="text" or @type="email"]')
# Negation with 'not()'
doc.xpath('//div[not(@class="hidden")]')
# Elements with child elements
doc.xpath('//div[child::p]')
Axes Navigation
# Following sibling elements
doc.xpath('//h2/following-sibling::p')
# Preceding sibling elements
doc.xpath('//p/preceding-sibling::h2')
# Ancestor elements
doc.xpath('//span/ancestor::div')
# Descendant elements
doc.xpath('//article/descendant::a')
Practical Web Scraping Examples
E-commerce Product Scraping
require 'nokogiri'
require 'net/http'
class ProductScraper
def initialize(url)
@doc = fetch_page(url)
end
def extract_products
return [] unless @doc
products = []
# Extract product information using XPath
product_nodes = @doc.xpath('//div[@class="product-item"]')
product_nodes.each do |node|
product = {
name: extract_text(node, './/h3[@class="product-title"]'),
price: extract_text(node, './/span[@class="price"]'),
image: extract_attribute(node, './/img', 'src'),
link: extract_attribute(node, './/a', 'href'),
rating: extract_text(node, './/div[@class="rating"]/@data-rating')
}
products << product if product[:name] && product[:price]
end
products
end
private
def fetch_page(url)
uri = URI(url)
response = Net::HTTP.get_response(uri)
return nil unless response.code == '200'
Nokogiri::HTML(response.body)
rescue => e
puts "Error: #{e.message}"
nil
end
def extract_text(node, xpath)
element = node.xpath(xpath).first
element&.text&.strip
end
def extract_attribute(node, xpath, attribute)
element = node.xpath(xpath).first
element&.attr(attribute)
end
end
# Usage
scraper = ProductScraper.new('https://example-store.com/products')
products = scraper.extract_products
products.each { |product| puts product.inspect }
Table Data Extraction
def extract_table_data(doc, table_xpath)
table_data = []
# Get all table rows except header
rows = doc.xpath("#{table_xpath}//tr[position() > 1]")
rows.each do |row|
row_data = []
# Extract data from each cell
cells = row.xpath('.//td')
cells.each do |cell|
# Remove extra whitespace and newlines
text = cell.text.gsub(/\s+/, ' ').strip
row_data << text
end
table_data << row_data unless row_data.empty?
end
table_data
end
# Usage
table_data = extract_table_data(doc, '//table[@id="results"]')
table_data.each { |row| puts row.join(' | ') }
XPath vs CSS Selectors
| Feature | XPath | CSS Selectors |
|---------|--------|---------------|
| Syntax | //div[@class="item"]
| div.item
|
| Text selection | //p[contains(text(), "Hello")]
| Not possible |
| Backward navigation | //span/parent::div
| Not possible |
| Position-based | //div[3]
| div:nth-child(3)
|
| Performance | Slower for simple queries | Faster for simple queries |
| Flexibility | More powerful | Simpler syntax |
Best Practices and Tips
Performance Optimization
# Use specific paths when possible
# Good: //div[@id="content"]//p
# Avoid: //p (searches entire document)
# Cache frequently used elements
content_div = doc.xpath('//div[@id="content"]').first
if content_div
paragraphs = content_div.xpath('.//p')
links = content_div.xpath('.//a')
end
# Use CSS selectors for simple queries
titles = doc.css('h1, h2, h3') # Often faster than XPath
Error Handling
def safe_xpath_extract(doc, xpath, default = nil)
elements = doc.xpath(xpath)
return default if elements.empty?
elements.first.text.strip
rescue => e
puts "XPath error: #{e.message}"
default
end
# Usage
title = safe_xpath_extract(doc, '//h1', 'No title found')
Debugging XPath Expressions
# Test XPath expressions in browser console
# $x('//div[@class="product"]') in Chrome/Firefox
# Debug in Ruby
def debug_xpath(doc, xpath)
elements = doc.xpath(xpath)
puts "XPath: #{xpath}"
puts "Found #{elements.length} elements"
elements.first(3).each_with_index do |el, i|
puts "#{i + 1}: #{el.to_s[0..100]}..."
end
end
debug_xpath(doc, '//div[@class="product"]')
Common XPath Functions
# String functions
doc.xpath('//div[starts-with(@class, "product")]')
doc.xpath('//p[string-length(text()) > 50]')
doc.xpath('//a[normalize-space(text())="Click here"]')
# Numeric functions
doc.xpath('//div[count(child::p) > 2]')
doc.xpath('//tr[position() mod 2 = 0]') # Even rows
# Boolean functions
doc.xpath('//input[not(@disabled)]')
doc.xpath('//div[@class and @id]')
XPath provides unmatched flexibility for HTML parsing in Ruby web scraping projects. While CSS selectors are simpler for basic tasks, XPath's advanced features make it indispensable for complex data extraction scenarios. Master both approaches to become an efficient Ruby web scraper.