How do I extract data from nested HTML structures using Nokogiri?
Extracting data from nested HTML structures is a common challenge in web scraping, especially when dealing with complex layouts like product listings, comment threads, or hierarchical menus. Nokogiri provides powerful tools to navigate and extract data from deeply nested elements using CSS selectors, XPath expressions, and DOM traversal methods.
Understanding Nested HTML Structures
Before diving into extraction techniques, it's important to understand what constitutes nested HTML. Consider this typical e-commerce product structure:
<div class="product-container">
<div class="product-header">
<h2 class="product-title">Wireless Headphones</h2>
<div class="product-meta">
<span class="brand">AudioTech</span>
<span class="model">AT-WH-500</span>
</div>
</div>
<div class="product-details">
<div class="pricing">
<span class="price current">$89.99</span>
<span class="price original">$129.99</span>
</div>
<div class="specifications">
<ul class="spec-list">
<li><strong>Battery Life:</strong> 30 hours</li>
<li><strong>Weight:</strong> 250g</li>
<li><strong>Connectivity:</strong> Bluetooth 5.0</li>
</ul>
</div>
</div>
</div>
Basic Nested Data Extraction
Let's start with a simple example of extracting data from the nested product structure:
require 'nokogiri'
require 'open-uri'
# Parse the HTML document
html_content = <<~HTML
<div class="product-container">
<div class="product-header">
<h2 class="product-title">Wireless Headphones</h2>
<div class="product-meta">
<span class="brand">AudioTech</span>
<span class="model">AT-WH-500</span>
</div>
</div>
<div class="product-details">
<div class="pricing">
<span class="price current">$89.99</span>
<span class="price original">$129.99</span>
</div>
</div>
</div>
HTML
doc = Nokogiri::HTML(html_content)
# Extract nested data using CSS selectors
product_data = {
title: doc.css('.product-container .product-title').text.strip,
brand: doc.css('.product-container .brand').text.strip,
model: doc.css('.product-container .model').text.strip,
current_price: doc.css('.product-container .price.current').text.strip,
original_price: doc.css('.product-container .price.original').text.strip
}
puts product_data
# => {:title=>"Wireless Headphones", :brand=>"AudioTech", :model=>"AT-WH-500",
# :current_price=>"$89.99", :original_price=>"$129.99"}
Advanced CSS Selector Techniques
For more complex nested structures, you can use advanced CSS selectors with combinators:
require 'nokogiri'
def extract_nested_products(html_content)
doc = Nokogiri::HTML(html_content)
products = []
# Extract multiple products from a nested structure
doc.css('.product-container').each do |product|
# Use descendant selectors within each product container
product_info = {
title: product.css('.product-title').text.strip,
brand: product.css('.product-meta .brand').text.strip,
model: product.css('.product-meta .model').text.strip,
# Direct child selector
current_price: product.css('> .product-details .price.current').text.strip,
# Adjacent sibling selector
original_price: product.css('.price.current + .price.original').text.strip,
# Attribute selectors for nested elements
specifications: extract_specifications(product)
}
products << product_info
end
products
end
def extract_specifications(product_node)
specs = {}
product_node.css('.spec-list li').each do |spec_item|
# Extract key-value pairs from nested text
text = spec_item.text.strip
if text.include?(':')
key, value = text.split(':', 2)
specs[key.strip] = value.strip
end
end
specs
end
# Example usage with multiple nested products
html_with_multiple_products = <<~HTML
<div class="products-grid">
<div class="product-container">
<div class="product-header">
<h2 class="product-title">Wireless Headphones</h2>
<div class="product-meta">
<span class="brand">AudioTech</span>
<span class="model">AT-WH-500</span>
</div>
</div>
<div class="product-details">
<div class="pricing">
<span class="price current">$89.99</span>
<span class="price original">$129.99</span>
</div>
<div class="specifications">
<ul class="spec-list">
<li><strong>Battery Life:</strong> 30 hours</li>
<li><strong>Weight:</strong> 250g</li>
</ul>
</div>
</div>
</div>
</div>
HTML
products = extract_nested_products(html_with_multiple_products)
puts products.first[:specifications]
# => {"Battery Life"=>"30 hours", "Weight"=>"250g"}
XPath for Complex Nested Navigation
XPath provides more powerful traversal capabilities for deeply nested structures:
require 'nokogiri'
def extract_with_xpath(html_content)
doc = Nokogiri::HTML(html_content)
# XPath expressions for nested data extraction
results = {
# Descendant axis - finds elements anywhere in the subtree
all_prices: doc.xpath('//div[@class="product-container"]//span[contains(@class, "price")]')
.map(&:text),
# Parent axis - navigate up the tree
price_containers: doc.xpath('//span[@class="price current"]/parent::div/@class')
.map(&:value),
# Following-sibling axis - find siblings after current element
original_prices: doc.xpath('//span[@class="price current"]/following-sibling::span[@class="price original"]')
.map(&:text),
# Preceding-sibling axis - find siblings before current element
current_before_original: doc.xpath('//span[@class="price original"]/preceding-sibling::span[@class="price current"]')
.map(&:text),
# Ancestor axis - navigate up to find containing elements
product_titles_with_prices: doc.xpath('//span[@class="price current"]/ancestor::div[@class="product-container"]//h2[@class="product-title"]')
.map(&:text)
}
results
end
# Advanced XPath for conditional extraction
def extract_conditional_data(html_content)
doc = Nokogiri::HTML(html_content)
# Find products with discounts (both current and original prices)
discounted_products = doc.xpath('//div[@class="product-container"][.//span[@class="price original"]]')
discounted_products.map do |product|
{
title: product.xpath('.//h2[@class="product-title"]').text.strip,
current_price: product.xpath('.//span[@class="price current"]').text.strip,
original_price: product.xpath('.//span[@class="price original"]').text.strip,
discount_amount: calculate_discount(product)
}
end
end
def calculate_discount(product_node)
current = product_node.xpath('.//span[@class="price current"]').text.gsub(/[^\d.]/, '').to_f
original = product_node.xpath('.//span[@class="price original"]').text.gsub(/[^\d.]/, '').to_f
return 0 if original == 0
((original - current) / original * 100).round(2)
end
DOM Traversal Methods
Nokogiri provides Ruby-style methods for traversing nested structures:
require 'nokogiri'
class NestedDataExtractor
def initialize(html_content)
@doc = Nokogiri::HTML(html_content)
end
def extract_with_traversal
products = []
@doc.css('.product-container').each do |container|
product_data = {}
# Navigate to child elements
header = container.children.css('.product-header').first
if header
product_data[:title] = header.css('.product-title').text.strip
# Navigate to siblings within header
meta = header.css('.product-meta').first
if meta
product_data[:brand] = meta.children.css('.brand').text.strip
product_data[:model] = meta.children.css('.model').text.strip
end
end
# Navigate to next sibling section
details = container.css('.product-details').first
if details
# Navigate through nested pricing structure
pricing = details.children.css('.pricing').first
if pricing
pricing.children.each do |price_element|
if price_element.name == 'span' && price_element['class']
case price_element['class']
when 'price current'
product_data[:current_price] = price_element.text.strip
when 'price original'
product_data[:original_price] = price_element.text.strip
end
end
end
end
# Extract specifications using parent-child navigation
specs_section = details.css('.specifications').first
if specs_section
product_data[:specifications] = extract_nested_specs(specs_section)
end
end
products << product_data
end
products
end
private
def extract_nested_specs(specs_container)
specs = {}
# Navigate through nested list structure
spec_list = specs_container.css('.spec-list').first
return specs unless spec_list
spec_list.css('li').each do |item|
# Handle nested strong tags and text nodes
strong_element = item.css('strong').first
if strong_element
key = strong_element.text.strip.gsub(':', '')
# Get text after the strong element
value = item.text.gsub(strong_element.text, '').strip.gsub(/^:\s*/, '')
specs[key] = value
end
end
specs
end
end
# Usage example
html_content = File.read('product_page.html') # Your HTML content
extractor = NestedDataExtractor.new(html_content)
products = extractor.extract_with_traversal
Handling Dynamic Nested Structures
For websites with varying nested structures, create flexible extraction methods:
require 'nokogiri'
class FlexibleNestedExtractor
def initialize(html_content)
@doc = Nokogiri::HTML(html_content)
end
def extract_flexible_data
# Handle multiple possible structures
containers = find_product_containers
containers.map do |container|
extract_product_data(container)
end
end
private
def find_product_containers
# Try multiple selectors for different layouts
selectors = [
'.product-container',
'.product-item',
'.item-container',
'[data-product-id]',
'.product'
]
selectors.each do |selector|
elements = @doc.css(selector)
return elements if elements.any?
end
[]
end
def extract_product_data(container)
data = {}
# Flexible title extraction
data[:title] = extract_title(container)
data[:price] = extract_price(container)
data[:description] = extract_description(container)
data[:images] = extract_images(container)
data[:metadata] = extract_metadata(container)
data.compact # Remove nil values
end
def extract_title(container)
title_selectors = [
'h1', 'h2', 'h3',
'.title', '.product-title', '.name',
'[data-title]'
]
title_selectors.each do |selector|
element = container.css(selector).first
return element.text.strip if element && !element.text.strip.empty?
end
nil
end
def extract_price(container)
price_selectors = [
'.price', '.cost', '.amount',
'[data-price]', '.price-current',
'.price .current'
]
prices = []
price_selectors.each do |selector|
container.css(selector).each do |element|
price_text = element.text.strip
if price_text.match?(/[\$£€¥]\d+|\d+[\$£€¥]|\d+\.\d{2}/)
prices << {
value: price_text,
class: element['class'],
context: element.parent&.['class']
}
end
end
end
prices
end
def extract_description(container)
desc_selectors = [
'.description', '.product-description',
'.summary', '.details', 'p'
]
desc_selectors.each do |selector|
element = container.css(selector).first
next unless element
text = element.text.strip
return text if text.length > 20 # Ensure substantial content
end
nil
end
def extract_images(container)
images = []
container.css('img').each do |img|
src = img['src'] || img['data-src'] || img['data-lazy']
alt = img['alt']
if src && !src.strip.empty?
images << {
src: src.strip,
alt: alt&.strip,
class: img['class']
}
end
end
images
end
def extract_metadata(container)
metadata = {}
# Extract data attributes
container.attributes.each do |name, attr|
if name.start_with?('data-')
key = name.gsub('data-', '').gsub('-', '_')
metadata[key] = attr.value
end
end
# Extract nested metadata from specific containers
container.css('.metadata, .meta, .attributes').each do |meta_container|
meta_container.css('span, div').each do |element|
class_name = element['class']
if class_name && !element.text.strip.empty?
metadata[class_name] = element.text.strip
end
end
end
metadata
end
end
Performance Optimization for Large Nested Structures
When dealing with large documents with many nested elements, optimize your extraction:
require 'nokogiri'
require 'benchmark'
class OptimizedNestedExtractor
def initialize(html_content)
@doc = Nokogiri::HTML(html_content) { |config| config.noblanks }
end
def extract_efficiently
results = []
# Use XPath for better performance on large documents
product_containers = @doc.xpath('//div[@class="product-container"]')
product_containers.each do |container|
# Extract all needed data in a single pass
product_data = extract_single_pass(container)
results << product_data if product_data.any?
end
results
end
private
def extract_single_pass(container)
# Collect all relevant elements in one XPath query
elements = {
title: container.xpath('.//h2[@class="product-title"]').first,
brand: container.xpath('.//span[@class="brand"]').first,
model: container.xpath('.//span[@class="model"]').first,
current_price: container.xpath('.//span[@class="price current"]').first,
original_price: container.xpath('.//span[@class="price original"]').first,
specs: container.xpath('.//ul[@class="spec-list"]/li')
}
# Extract text content efficiently
data = {}
elements.each do |key, element_or_elements|
case key
when :specs
data[key] = element_or_elements.map do |li|
text = li.text.strip
if text.include?(':')
key_part, value_part = text.split(':', 2)
[key_part.strip, value_part.strip]
end
end.compact.to_h
else
data[key] = element_or_elements&.text&.strip
end
end
data.compact
end
end
# Benchmark different approaches
def benchmark_extraction_methods(html_content)
Benchmark.bm(20) do |x|
x.report("CSS Selectors:") do
1000.times { extract_with_css(html_content) }
end
x.report("XPath:") do
1000.times { extract_with_xpath(html_content) }
end
x.report("Optimized:") do
1000.times { OptimizedNestedExtractor.new(html_content).extract_efficiently }
end
end
end
Error Handling for Nested Extraction
Robust error handling is crucial when working with complex nested structures:
require 'nokogiri'
class RobustNestedExtractor
def initialize(html_content)
begin
@doc = Nokogiri::HTML(html_content)
rescue StandardError => e
raise "Failed to parse HTML: #{e.message}"
end
end
def safe_extract
products = []
begin
containers = @doc.css('.product-container')
containers.each_with_index do |container, index|
begin
product_data = extract_with_fallbacks(container)
products << product_data if product_data && product_data.any?
rescue StandardError => e
puts "Error extracting product #{index}: #{e.message}"
# Continue with next product
next
end
end
rescue StandardError => e
puts "Critical error during extraction: #{e.message}"
return []
end
products
end
private
def extract_with_fallbacks(container)
data = {}
# Title extraction with multiple fallbacks
data[:title] = safe_extract_text(container, [
'.product-title',
'h1', 'h2', 'h3',
'.title', '.name'
])
# Price extraction with validation
data[:price] = safe_extract_price(container)
# Description with length validation
data[:description] = safe_extract_description(container)
data.compact
end
def safe_extract_text(container, selectors)
selectors.each do |selector|
begin
element = container.css(selector).first
return element.text.strip if element && !element.text.strip.empty?
rescue StandardError
next
end
end
nil
end
def safe_extract_price(container)
price_selectors = ['.price', '.cost', '.amount']
price_selectors.each do |selector|
begin
element = container.css(selector).first
next unless element
price_text = element.text.strip
# Validate price format
if price_text.match?(/[\$£€¥]?\d+\.?\d*/)
return price_text
end
rescue StandardError
next
end
end
nil
end
def safe_extract_description(container)
desc_selectors = ['.description', '.summary', 'p']
desc_selectors.each do |selector|
begin
element = container.css(selector).first
next unless element
text = element.text.strip
# Ensure minimum length and reasonable maximum
return text if text.length.between?(10, 1000)
rescue StandardError
next
end
end
nil
end
end
Integration with Modern Web Scraping Workflows
While Nokogiri excels at parsing static HTML structures, modern web applications often load content dynamically. For complex scenarios involving JavaScript-rendered content, you might need to combine Nokogiri with browser automation tools. Consider exploring how to handle AJAX requests using Puppeteer for dynamic content or learn about navigating to different pages using Puppeteer for comprehensive multi-page extraction workflows.
Conclusion
Extracting data from nested HTML structures using Nokogiri requires understanding both the document structure and the appropriate selection methods. Whether using CSS selectors for simple cases, XPath for complex navigation, or DOM traversal methods for dynamic scenarios, the key is to match your approach to the complexity of the data structure.
Remember to implement proper error handling, optimize for performance when dealing with large documents, and consider the maintainability of your extraction code. With these techniques, you'll be able to efficiently extract data from even the most complex nested HTML structures in your Ruby web scraping projects.
The combination of Nokogiri's powerful parsing capabilities with Ruby's flexible syntax makes it an excellent choice for handling nested data extraction challenges across various web scraping scenarios.