What is XPath and how is it used in Ruby web scraping?

What is XPath?

XPath (XML Path Language) is a powerful query language used to navigate and select nodes in XML and HTML documents. It treats documents as tree structures and uses path expressions similar to file system paths to locate elements. XPath is particularly valuable for web scraping because it provides precise control over element selection, even in complex nested HTML structures.

Unlike CSS selectors, XPath can traverse both forward and backward through the document tree, making it ideal for complex data extraction scenarios.

Setting Up XPath with Nokogiri

Nokogiri is Ruby's most popular HTML/XML parsing library and provides excellent XPath support:

# Add to your Gemfile
gem 'nokogiri'

# Or install directly
# gem install nokogiri

Basic XPath Usage in Ruby

Here's a complete example demonstrating XPath with Nokogiri:

require 'nokogiri'
require 'net/http'
require 'uri'

# Fetch and parse HTML document
def fetch_and_parse(url)
  uri = URI(url)
  response = Net::HTTP.get_response(uri)
  Nokogiri::HTML(response.body)
rescue => e
  puts "Error fetching #{url}: #{e.message}"
  nil
end

# Example usage
doc = fetch_and_parse('https://example.com')
return unless doc

# Basic element selection
titles = doc.xpath('//h1')
titles.each { |title| puts title.text.strip }

Essential XPath Syntax

Basic Selectors

# Absolute path from root
doc.xpath('/html/body/div')

# Relative path - anywhere in document
doc.xpath('//div')

# Current node
doc.xpath('.')

# Parent node
doc.xpath('..')

# All child nodes
doc.xpath('*')

Attribute Selection

# Select by attribute value
doc.xpath('//div[@class="content"]')

# Select by partial attribute match
doc.xpath('//div[contains(@class, "product")]')

# Select by attribute existence
doc.xpath('//img[@alt]')

# Get attribute values
hrefs = doc.xpath('//a/@href')
hrefs.each { |href| puts href.value }

Position-based Selection

# First element
doc.xpath('//div[1]')

# Last element
doc.xpath('//div[last()]')

# First two elements
doc.xpath('//div[position() <= 2]')

# All but first element
doc.xpath('//div[position() > 1]')

Advanced XPath Techniques

Text Content Selection

# Elements containing specific text
doc.xpath('//h2[contains(text(), "Product")]')

# Elements with exact text match
doc.xpath('//button[text()="Submit"]')

# Get text content directly
prices = doc.xpath('//span[@class="price"]/text()')
prices.each { |price| puts price.to_s.strip }

Complex Predicates

# Multiple conditions with 'and'
doc.xpath('//div[@class="item" and @data-id]')

# Multiple conditions with 'or'
doc.xpath('//input[@type="text" or @type="email"]')

# Negation with 'not()'
doc.xpath('//div[not(@class="hidden")]')

# Elements with child elements
doc.xpath('//div[child::p]')

Axes Navigation

# Following sibling elements
doc.xpath('//h2/following-sibling::p')

# Preceding sibling elements
doc.xpath('//p/preceding-sibling::h2')

# Ancestor elements
doc.xpath('//span/ancestor::div')

# Descendant elements
doc.xpath('//article/descendant::a')

Practical Web Scraping Examples

E-commerce Product Scraping

require 'nokogiri'
require 'net/http'

class ProductScraper
  def initialize(url)
    @doc = fetch_page(url)
  end

  def extract_products
    return [] unless @doc

    products = []

    # Extract product information using XPath
    product_nodes = @doc.xpath('//div[@class="product-item"]')

    product_nodes.each do |node|
      product = {
        name: extract_text(node, './/h3[@class="product-title"]'),
        price: extract_text(node, './/span[@class="price"]'),
        image: extract_attribute(node, './/img', 'src'),
        link: extract_attribute(node, './/a', 'href'),
        rating: extract_text(node, './/div[@class="rating"]/@data-rating')
      }

      products << product if product[:name] && product[:price]
    end

    products
  end

  private

  def fetch_page(url)
    uri = URI(url)
    response = Net::HTTP.get_response(uri)
    return nil unless response.code == '200'

    Nokogiri::HTML(response.body)
  rescue => e
    puts "Error: #{e.message}"
    nil
  end

  def extract_text(node, xpath)
    element = node.xpath(xpath).first
    element&.text&.strip
  end

  def extract_attribute(node, xpath, attribute)
    element = node.xpath(xpath).first
    element&.attr(attribute)
  end
end

# Usage
scraper = ProductScraper.new('https://example-store.com/products')
products = scraper.extract_products
products.each { |product| puts product.inspect }

Table Data Extraction

def extract_table_data(doc, table_xpath)
  table_data = []

  # Get all table rows except header
  rows = doc.xpath("#{table_xpath}//tr[position() > 1]")

  rows.each do |row|
    row_data = []

    # Extract data from each cell
    cells = row.xpath('.//td')
    cells.each do |cell|
      # Remove extra whitespace and newlines
      text = cell.text.gsub(/\s+/, ' ').strip
      row_data << text
    end

    table_data << row_data unless row_data.empty?
  end

  table_data
end

# Usage
table_data = extract_table_data(doc, '//table[@id="results"]')
table_data.each { |row| puts row.join(' | ') }

XPath vs CSS Selectors

| Feature | XPath | CSS Selectors | |---------|--------|---------------| | Syntax | //div[@class="item"] | div.item | | Text selection | //p[contains(text(), "Hello")] | Not possible | | Backward navigation | //span/parent::div | Not possible | | Position-based | //div[3] | div:nth-child(3) | | Performance | Slower for simple queries | Faster for simple queries | | Flexibility | More powerful | Simpler syntax |

Best Practices and Tips

Performance Optimization

# Use specific paths when possible
# Good: //div[@id="content"]//p
# Avoid: //p (searches entire document)

# Cache frequently used elements
content_div = doc.xpath('//div[@id="content"]').first
if content_div
  paragraphs = content_div.xpath('.//p')
  links = content_div.xpath('.//a')
end

# Use CSS selectors for simple queries
titles = doc.css('h1, h2, h3')  # Often faster than XPath

Error Handling

def safe_xpath_extract(doc, xpath, default = nil)
  elements = doc.xpath(xpath)
  return default if elements.empty?

  elements.first.text.strip
rescue => e
  puts "XPath error: #{e.message}"
  default
end

# Usage
title = safe_xpath_extract(doc, '//h1', 'No title found')

Debugging XPath Expressions

# Test XPath expressions in browser console
# $x('//div[@class="product"]') in Chrome/Firefox

# Debug in Ruby
def debug_xpath(doc, xpath)
  elements = doc.xpath(xpath)
  puts "XPath: #{xpath}"
  puts "Found #{elements.length} elements"
  elements.first(3).each_with_index do |el, i|
    puts "#{i + 1}: #{el.to_s[0..100]}..."
  end
end

debug_xpath(doc, '//div[@class="product"]')

Common XPath Functions

# String functions
doc.xpath('//div[starts-with(@class, "product")]')
doc.xpath('//p[string-length(text()) > 50]')
doc.xpath('//a[normalize-space(text())="Click here"]')

# Numeric functions
doc.xpath('//div[count(child::p) > 2]')
doc.xpath('//tr[position() mod 2 = 0]')  # Even rows

# Boolean functions
doc.xpath('//input[not(@disabled)]')
doc.xpath('//div[@class and @id]')

XPath provides unmatched flexibility for HTML parsing in Ruby web scraping projects. While CSS selectors are simpler for basic tasks, XPath's advanced features make it indispensable for complex data extraction scenarios. Master both approaches to become an efficient Ruby web scraper.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon