How do I extract data from nested HTML structures using Nokogiri?

Extracting data from nested HTML structures is a common challenge in web scraping, especially when dealing with complex layouts like product listings, comment threads, or hierarchical menus. Nokogiri provides powerful tools to navigate and extract data from deeply nested elements using CSS selectors, XPath expressions, and DOM traversal methods.

Understanding Nested HTML Structures

Before diving into extraction techniques, it's important to understand what constitutes nested HTML. Consider this typical e-commerce product structure:

<div class="product-container">
  <div class="product-header">
    <h2 class="product-title">Wireless Headphones</h2>
    <div class="product-meta">
      <span class="brand">AudioTech</span>
      <span class="model">AT-WH-500</span>
    </div>
  </div>
  <div class="product-details">
    <div class="pricing">
      <span class="price current">$89.99</span>
      <span class="price original">$129.99</span>
    </div>
    <div class="specifications">
      <ul class="spec-list">
        <li><strong>Battery Life:</strong> 30 hours</li>
        <li><strong>Weight:</strong> 250g</li>
        <li><strong>Connectivity:</strong> Bluetooth 5.0</li>
      </ul>
    </div>
  </div>
</div>

Basic Nested Data Extraction

Let's start with a simple example of extracting data from the nested product structure:

require 'nokogiri'
require 'open-uri'

# Parse the HTML document
html_content = <<~HTML
  <div class="product-container">
    <div class="product-header">
      <h2 class="product-title">Wireless Headphones</h2>
      <div class="product-meta">
        <span class="brand">AudioTech</span>
        <span class="model">AT-WH-500</span>
      </div>
    </div>
    <div class="product-details">
      <div class="pricing">
        <span class="price current">$89.99</span>
        <span class="price original">$129.99</span>
      </div>
    </div>
  </div>
HTML

doc = Nokogiri::HTML(html_content)

# Extract nested data using CSS selectors
product_data = {
  title: doc.css('.product-container .product-title').text.strip,
  brand: doc.css('.product-container .brand').text.strip,
  model: doc.css('.product-container .model').text.strip,
  current_price: doc.css('.product-container .price.current').text.strip,
  original_price: doc.css('.product-container .price.original').text.strip
}

puts product_data
# => {:title=>"Wireless Headphones", :brand=>"AudioTech", :model=>"AT-WH-500", 
#     :current_price=>"$89.99", :original_price=>"$129.99"}

Advanced CSS Selector Techniques

For more complex nested structures, you can use advanced CSS selectors with combinators:

require 'nokogiri'

def extract_nested_products(html_content)
  doc = Nokogiri::HTML(html_content)
  products = []

  # Extract multiple products from a nested structure
  doc.css('.product-container').each do |product|
    # Use descendant selectors within each product container
    product_info = {
      title: product.css('.product-title').text.strip,
      brand: product.css('.product-meta .brand').text.strip,
      model: product.css('.product-meta .model').text.strip,

      # Direct child selector
      current_price: product.css('> .product-details .price.current').text.strip,

      # Adjacent sibling selector
      original_price: product.css('.price.current + .price.original').text.strip,

      # Attribute selectors for nested elements
      specifications: extract_specifications(product)
    }

    products << product_info
  end

  products
end

def extract_specifications(product_node)
  specs = {}

  product_node.css('.spec-list li').each do |spec_item|
    # Extract key-value pairs from nested text
    text = spec_item.text.strip
    if text.include?(':')
      key, value = text.split(':', 2)
      specs[key.strip] = value.strip
    end
  end

  specs
end

# Example usage with multiple nested products
html_with_multiple_products = <<~HTML
  <div class="products-grid">
    <div class="product-container">
      <div class="product-header">
        <h2 class="product-title">Wireless Headphones</h2>
        <div class="product-meta">
          <span class="brand">AudioTech</span>
          <span class="model">AT-WH-500</span>
        </div>
      </div>
      <div class="product-details">
        <div class="pricing">
          <span class="price current">$89.99</span>
          <span class="price original">$129.99</span>
        </div>
        <div class="specifications">
          <ul class="spec-list">
            <li><strong>Battery Life:</strong> 30 hours</li>
            <li><strong>Weight:</strong> 250g</li>
          </ul>
        </div>
      </div>
    </div>
  </div>
HTML

products = extract_nested_products(html_with_multiple_products)
puts products.first[:specifications]
# => {"Battery Life"=>"30 hours", "Weight"=>"250g"}

XPath for Complex Nested Navigation

XPath provides more powerful traversal capabilities for deeply nested structures:

require 'nokogiri'

def extract_with_xpath(html_content)
  doc = Nokogiri::HTML(html_content)

  # XPath expressions for nested data extraction
  results = {
    # Descendant axis - finds elements anywhere in the subtree
    all_prices: doc.xpath('//div[@class="product-container"]//span[contains(@class, "price")]')
                   .map(&:text),

    # Parent axis - navigate up the tree
    price_containers: doc.xpath('//span[@class="price current"]/parent::div/@class')
                         .map(&:value),

    # Following-sibling axis - find siblings after current element
    original_prices: doc.xpath('//span[@class="price current"]/following-sibling::span[@class="price original"]')
                        .map(&:text),

    # Preceding-sibling axis - find siblings before current element
    current_before_original: doc.xpath('//span[@class="price original"]/preceding-sibling::span[@class="price current"]')
                                .map(&:text),

    # Ancestor axis - navigate up to find containing elements
    product_titles_with_prices: doc.xpath('//span[@class="price current"]/ancestor::div[@class="product-container"]//h2[@class="product-title"]')
                                   .map(&:text)
  }

  results
end

# Advanced XPath for conditional extraction
def extract_conditional_data(html_content)
  doc = Nokogiri::HTML(html_content)

  # Find products with discounts (both current and original prices)
  discounted_products = doc.xpath('//div[@class="product-container"][.//span[@class="price original"]]')

  discounted_products.map do |product|
    {
      title: product.xpath('.//h2[@class="product-title"]').text.strip,
      current_price: product.xpath('.//span[@class="price current"]').text.strip,
      original_price: product.xpath('.//span[@class="price original"]').text.strip,
      discount_amount: calculate_discount(product)
    }
  end
end

def calculate_discount(product_node)
  current = product_node.xpath('.//span[@class="price current"]').text.gsub(/[^\d.]/, '').to_f
  original = product_node.xpath('.//span[@class="price original"]').text.gsub(/[^\d.]/, '').to_f

  return 0 if original == 0

  ((original - current) / original * 100).round(2)
end

DOM Traversal Methods

Nokogiri provides Ruby-style methods for traversing nested structures:

require 'nokogiri'

class NestedDataExtractor
  def initialize(html_content)
    @doc = Nokogiri::HTML(html_content)
  end

  def extract_with_traversal
    products = []

    @doc.css('.product-container').each do |container|
      product_data = {}

      # Navigate to child elements
      header = container.children.css('.product-header').first
      if header
        product_data[:title] = header.css('.product-title').text.strip

        # Navigate to siblings within header
        meta = header.css('.product-meta').first
        if meta
          product_data[:brand] = meta.children.css('.brand').text.strip
          product_data[:model] = meta.children.css('.model').text.strip
        end
      end

      # Navigate to next sibling section
      details = container.css('.product-details').first
      if details
        # Navigate through nested pricing structure
        pricing = details.children.css('.pricing').first
        if pricing
          pricing.children.each do |price_element|
            if price_element.name == 'span' && price_element['class']
              case price_element['class']
              when 'price current'
                product_data[:current_price] = price_element.text.strip
              when 'price original'
                product_data[:original_price] = price_element.text.strip
              end
            end
          end
        end

        # Extract specifications using parent-child navigation
        specs_section = details.css('.specifications').first
        if specs_section
          product_data[:specifications] = extract_nested_specs(specs_section)
        end
      end

      products << product_data
    end

    products
  end

  private

  def extract_nested_specs(specs_container)
    specs = {}

    # Navigate through nested list structure
    spec_list = specs_container.css('.spec-list').first
    return specs unless spec_list

    spec_list.css('li').each do |item|
      # Handle nested strong tags and text nodes
      strong_element = item.css('strong').first
      if strong_element
        key = strong_element.text.strip.gsub(':', '')
        # Get text after the strong element
        value = item.text.gsub(strong_element.text, '').strip.gsub(/^:\s*/, '')
        specs[key] = value
      end
    end

    specs
  end
end

# Usage example
html_content = File.read('product_page.html') # Your HTML content
extractor = NestedDataExtractor.new(html_content)
products = extractor.extract_with_traversal

Handling Dynamic Nested Structures

For websites with varying nested structures, create flexible extraction methods:

require 'nokogiri'

class FlexibleNestedExtractor
  def initialize(html_content)
    @doc = Nokogiri::HTML(html_content)
  end

  def extract_flexible_data
    # Handle multiple possible structures
    containers = find_product_containers

    containers.map do |container|
      extract_product_data(container)
    end
  end

  private

  def find_product_containers
    # Try multiple selectors for different layouts
    selectors = [
      '.product-container',
      '.product-item',
      '.item-container',
      '[data-product-id]',
      '.product'
    ]

    selectors.each do |selector|
      elements = @doc.css(selector)
      return elements if elements.any?
    end

    []
  end

  def extract_product_data(container)
    data = {}

    # Flexible title extraction
    data[:title] = extract_title(container)
    data[:price] = extract_price(container)
    data[:description] = extract_description(container)
    data[:images] = extract_images(container)
    data[:metadata] = extract_metadata(container)

    data.compact # Remove nil values
  end

  def extract_title(container)
    title_selectors = [
      'h1', 'h2', 'h3',
      '.title', '.product-title', '.name',
      '[data-title]'
    ]

    title_selectors.each do |selector|
      element = container.css(selector).first
      return element.text.strip if element && !element.text.strip.empty?
    end

    nil
  end

  def extract_price(container)
    price_selectors = [
      '.price', '.cost', '.amount',
      '[data-price]', '.price-current',
      '.price .current'
    ]

    prices = []

    price_selectors.each do |selector|
      container.css(selector).each do |element|
        price_text = element.text.strip
        if price_text.match?(/[\$£€¥]\d+|\d+[\$£€¥]|\d+\.\d{2}/)
          prices << {
            value: price_text,
            class: element['class'],
            context: element.parent&.['class']
          }
        end
      end
    end

    prices
  end

  def extract_description(container)
    desc_selectors = [
      '.description', '.product-description',
      '.summary', '.details', 'p'
    ]

    desc_selectors.each do |selector|
      element = container.css(selector).first
      next unless element

      text = element.text.strip
      return text if text.length > 20 # Ensure substantial content
    end

    nil
  end

  def extract_images(container)
    images = []

    container.css('img').each do |img|
      src = img['src'] || img['data-src'] || img['data-lazy']
      alt = img['alt']

      if src && !src.strip.empty?
        images << {
          src: src.strip,
          alt: alt&.strip,
          class: img['class']
        }
      end
    end

    images
  end

  def extract_metadata(container)
    metadata = {}

    # Extract data attributes
    container.attributes.each do |name, attr|
      if name.start_with?('data-')
        key = name.gsub('data-', '').gsub('-', '_')
        metadata[key] = attr.value
      end
    end

    # Extract nested metadata from specific containers
    container.css('.metadata, .meta, .attributes').each do |meta_container|
      meta_container.css('span, div').each do |element|
        class_name = element['class']
        if class_name && !element.text.strip.empty?
          metadata[class_name] = element.text.strip
        end
      end
    end

    metadata
  end
end

Performance Optimization for Large Nested Structures

When dealing with large documents with many nested elements, optimize your extraction:

require 'nokogiri'
require 'benchmark'

class OptimizedNestedExtractor
  def initialize(html_content)
    @doc = Nokogiri::HTML(html_content) { |config| config.noblanks }
  end

  def extract_efficiently
    results = []

    # Use XPath for better performance on large documents
    product_containers = @doc.xpath('//div[@class="product-container"]')

    product_containers.each do |container|
      # Extract all needed data in a single pass
      product_data = extract_single_pass(container)
      results << product_data if product_data.any?
    end

    results
  end

  private

  def extract_single_pass(container)
    # Collect all relevant elements in one XPath query
    elements = {
      title: container.xpath('.//h2[@class="product-title"]').first,
      brand: container.xpath('.//span[@class="brand"]').first,
      model: container.xpath('.//span[@class="model"]').first,
      current_price: container.xpath('.//span[@class="price current"]').first,
      original_price: container.xpath('.//span[@class="price original"]').first,
      specs: container.xpath('.//ul[@class="spec-list"]/li')
    }

    # Extract text content efficiently
    data = {}

    elements.each do |key, element_or_elements|
      case key
      when :specs
        data[key] = element_or_elements.map do |li|
          text = li.text.strip
          if text.include?(':')
            key_part, value_part = text.split(':', 2)
            [key_part.strip, value_part.strip]
          end
        end.compact.to_h
      else
        data[key] = element_or_elements&.text&.strip
      end
    end

    data.compact
  end
end

# Benchmark different approaches
def benchmark_extraction_methods(html_content)
  Benchmark.bm(20) do |x|
    x.report("CSS Selectors:") do
      1000.times { extract_with_css(html_content) }
    end

    x.report("XPath:") do
      1000.times { extract_with_xpath(html_content) }
    end

    x.report("Optimized:") do
      1000.times { OptimizedNestedExtractor.new(html_content).extract_efficiently }
    end
  end
end

Error Handling for Nested Extraction

Robust error handling is crucial when working with complex nested structures:

require 'nokogiri'

class RobustNestedExtractor
  def initialize(html_content)
    begin
      @doc = Nokogiri::HTML(html_content)
    rescue StandardError => e
      raise "Failed to parse HTML: #{e.message}"
    end
  end

  def safe_extract
    products = []

    begin
      containers = @doc.css('.product-container')

      containers.each_with_index do |container, index|
        begin
          product_data = extract_with_fallbacks(container)
          products << product_data if product_data && product_data.any?
        rescue StandardError => e
          puts "Error extracting product #{index}: #{e.message}"
          # Continue with next product
          next
        end
      end

    rescue StandardError => e
      puts "Critical error during extraction: #{e.message}"
      return []
    end

    products
  end

  private

  def extract_with_fallbacks(container)
    data = {}

    # Title extraction with multiple fallbacks
    data[:title] = safe_extract_text(container, [
      '.product-title',
      'h1', 'h2', 'h3',
      '.title', '.name'
    ])

    # Price extraction with validation
    data[:price] = safe_extract_price(container)

    # Description with length validation
    data[:description] = safe_extract_description(container)

    data.compact
  end

  def safe_extract_text(container, selectors)
    selectors.each do |selector|
      begin
        element = container.css(selector).first
        return element.text.strip if element && !element.text.strip.empty?
      rescue StandardError
        next
      end
    end

    nil
  end

  def safe_extract_price(container)
    price_selectors = ['.price', '.cost', '.amount']

    price_selectors.each do |selector|
      begin
        element = container.css(selector).first
        next unless element

        price_text = element.text.strip
        # Validate price format
        if price_text.match?(/[\$£€¥]?\d+\.?\d*/)
          return price_text
        end
      rescue StandardError
        next
      end
    end

    nil
  end

  def safe_extract_description(container)
    desc_selectors = ['.description', '.summary', 'p']

    desc_selectors.each do |selector|
      begin
        element = container.css(selector).first
        next unless element

        text = element.text.strip
        # Ensure minimum length and reasonable maximum
        return text if text.length.between?(10, 1000)
      rescue StandardError
        next
      end
    end

    nil
  end
end

Integration with Modern Web Scraping Workflows

While Nokogiri excels at parsing static HTML structures, modern web applications often load content dynamically. For complex scenarios involving JavaScript-rendered content, you might need to combine Nokogiri with browser automation tools. Consider exploring how to handle AJAX requests using Puppeteer for dynamic content or learn about navigating to different pages using Puppeteer for comprehensive multi-page extraction workflows.

Conclusion

Extracting data from nested HTML structures using Nokogiri requires understanding both the document structure and the appropriate selection methods. Whether using CSS selectors for simple cases, XPath for complex navigation, or DOM traversal methods for dynamic scenarios, the key is to match your approach to the complexity of the data structure.

Remember to implement proper error handling, optimize for performance when dealing with large documents, and consider the maintainability of your extraction code. With these techniques, you'll be able to efficiently extract data from even the most complex nested HTML structures in your Ruby web scraping projects.

The combination of Nokogiri's powerful parsing capabilities with Ruby's flexible syntax makes it an excellent choice for handling nested data extraction challenges across various web scraping scenarios.

Table of contents

How do I extract data from nested HTML structures using Nokogiri?

Understanding Nested HTML Structures

Basic Nested Data Extraction

Advanced CSS Selector Techniques

XPath for Complex Nested Navigation

DOM Traversal Methods

Handling Dynamic Nested Structures

Performance Optimization for Large Nested Structures

Error Handling for Nested Extraction

Integration with Modern Web Scraping Workflows

Conclusion

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Related Questions

How can I use regular expressions with Nokogiri selectors?

What are the security considerations when using Nokogiri with untrusted HTML?

How do I handle different character encodings in Nokogiri?

Get Started Now

Support