How do I iterate through collections of elements in Nokogiri?
Iterating through collections of elements is a fundamental operation when web scraping with Nokogiri. Whether you're extracting data from multiple table rows, processing a list of articles, or working with navigation menus, understanding how to efficiently loop through element collections is essential for effective data extraction.
Understanding Nokogiri Collections
When you search for elements using Nokogiri's search
, css
, or xpath
methods, you get back a Nokogiri::XML::NodeSet
object. This collection behaves like a Ruby array and includes all the Enumerable methods you're familiar with.
require 'nokogiri'
require 'open-uri'
# Parse HTML document
html = <<-HTML
<div class="container">
<div class="item">Item 1</div>
<div class="item">Item 2</div>
<div class="item">Item 3</div>
<ul class="list">
<li>Apple</li>
<li>Banana</li>
<li>Cherry</li>
</ul>
</div>
HTML
doc = Nokogiri::HTML(html)
items = doc.css('.item')
puts items.class # => Nokogiri::XML::NodeSet
puts items.length # => 3
Basic Iteration with Each
The most common way to iterate through a collection is using the each
method:
# Basic iteration
doc.css('.item').each do |item|
puts item.text
end
# Output:
# Item 1
# Item 2
# Item 3
# With index
doc.css('.item').each_with_index do |item, index|
puts "#{index + 1}: #{item.text}"
end
# Output:
# 1: Item 1
# 2: Item 2
# 3: Item 3
Using Map for Transformation
When you need to transform elements into a new collection, use map
:
# Extract text content into an array
item_texts = doc.css('.item').map(&:text)
puts item_texts # => ["Item 1", "Item 2", "Item 3"]
# Extract specific attributes
links = doc.css('a').map { |link| link['href'] }
# More complex transformations
item_data = doc.css('.item').map do |item|
{
text: item.text.strip,
class: item['class'],
position: item.parent.children.index(item)
}
end
Practical Examples
Extracting Table Data
html_table = <<-HTML
<table>
<thead>
<tr><th>Name</th><th>Age</th><th>City</th></tr>
</thead>
<tbody>
<tr><td>John</td><td>25</td><td>New York</td></tr>
<tr><td>Jane</td><td>30</td><td>London</td></tr>
<tr><td>Bob</td><td>35</td><td>Paris</td></tr>
</tbody>
</table>
HTML
doc = Nokogiri::HTML(html_table)
# Extract headers
headers = doc.css('thead th').map(&:text)
# Extract all rows
rows = doc.css('tbody tr').map do |row|
cells = row.css('td').map(&:text)
Hash[headers.zip(cells)]
end
puts rows
# Output:
# [
# {"Name"=>"John", "Age"=>"25", "City"=>"New York"},
# {"Name"=>"Jane", "Age"=>"30", "City"=>"London"},
# {"Name"=>"Bob", "Age"=>"35", "City"=>"Paris"}
# ]
Processing Article Lists
html_articles = <<-HTML
<div class="articles">
<article>
<h2>First Article</h2>
<p class="meta">By Author 1</p>
<p>Article content...</p>
</article>
<article>
<h2>Second Article</h2>
<p class="meta">By Author 2</p>
<p>Different content...</p>
</article>
</div>
HTML
doc = Nokogiri::HTML(html_articles)
articles = doc.css('article').map do |article|
{
title: article.at('h2')&.text&.strip,
author: article.at('.meta')&.text&.strip,
content: article.css('p:not(.meta)').map(&:text).join(' ')
}
end
articles.each do |article|
puts "Title: #{article[:title]}"
puts "Author: #{article[:author]}"
puts "Content: #{article[:content][0..50]}..."
puts "---"
end
Advanced Iteration Techniques
Filtering While Iterating
# Using select to filter elements
active_items = doc.css('.item').select { |item| item['class']&.include?('active') }
# Using reject to exclude elements
non_empty_paragraphs = doc.css('p').reject { |p| p.text.strip.empty? }
# Combining with other enumerable methods
first_three_items = doc.css('.item').first(3)
last_item = doc.css('.item').last
Nested Iterations
html_nested = <<-HTML
<div class="sections">
<section>
<h3>Section 1</h3>
<ul>
<li>Item A</li>
<li>Item B</li>
</ul>
</section>
<section>
<h3>Section 2</h3>
<ul>
<li>Item C</li>
<li>Item D</li>
</ul>
</section>
</div>
HTML
doc = Nokogiri::HTML(html_nested)
doc.css('section').each do |section|
title = section.at('h3').text
puts "Section: #{title}"
section.css('li').each do |item|
puts " - #{item.text}"
end
end
Working with Complex Structures
# Processing form elements
doc.css('form').each do |form|
form_data = {
action: form['action'],
method: form['method'] || 'GET',
fields: []
}
form.css('input, select, textarea').each do |field|
field_info = {
name: field['name'],
type: field['type'] || field.name,
required: field.key?('required')
}
form_data[:fields] << field_info
end
puts form_data
end
Performance Considerations
Efficient Element Access
# Inefficient - searches the entire document each time
doc.css('.item').each do |item|
siblings = doc.css('.item') # Don't do this!
# Process item...
end
# Efficient - cache the collection
items = doc.css('.item')
items.each do |item|
# Use the cached collection
# Process item...
end
Memory-Efficient Processing
# For large documents, consider processing in chunks
def process_large_collection(doc, selector, chunk_size = 100)
elements = doc.css(selector)
elements.each_slice(chunk_size) do |chunk|
chunk.each do |element|
# Process element
yield element if block_given?
end
# Optional: garbage collection for very large datasets
GC.start if chunk_size > 1000
end
end
# Usage
process_large_collection(doc, '.item', 50) do |item|
puts item.text
end
Error Handling in Iterations
doc.css('.item').each do |item|
begin
# Extract data that might fail
title = item.at('h2').text
date = Date.parse(item.at('.date').text)
puts "#{title} - #{date}"
rescue StandardError => e
puts "Error processing item: #{e.message}"
# Log the problematic HTML for debugging
puts "Problematic HTML: #{item.to_html}"
next # Continue with the next item
end
end
JavaScript-Heavy Content Considerations
While Nokogiri excels at parsing static HTML content, modern web applications often load data dynamically through JavaScript. For sites that require JavaScript execution, you might need to combine Nokogiri with browser automation tools. Learn more about handling dynamic content that loads after page load or navigating through multiple pages for comprehensive scraping workflows.
Best Practices for Element Iteration
1. Cache Collections for Performance
# Cache the NodeSet instead of re-querying
products = doc.css('.product')
products.each do |product|
name = product.at('.name').text
price = product.at('.price').text
# Process each product...
end
2. Use Appropriate Enumerable Methods
# Use each for side effects (printing, saving to database)
doc.css('.item').each { |item| puts item.text }
# Use map for transformations
item_data = doc.css('.item').map { |item| item.text.strip }
# Use select for filtering
active_items = doc.css('.item').select { |item| item['class'].include?('active') }
# Use find for single matching element
first_match = doc.css('.item').find { |item| item.text.include?('search term') }
3. Handle Missing Elements Gracefully
doc.css('.product').each do |product|
# Use safe navigation to handle missing elements
name = product.at('.name')&.text&.strip
price = product.at('.price')&.text&.strip
# Skip products with missing required data
next unless name && price
puts "#{name}: #{price}"
end
4. Optimize for Large Documents
# Use streaming for very large XML documents
require 'nokogiri'
class ProductParser < Nokogiri::XML::SAX::Document
def initialize
@products = []
end
def start_element(name, attributes = [])
if name == 'product'
@current_product = {}
end
end
def characters(string)
# Process character data
end
def end_element(name)
if name == 'product'
@products << @current_product
end
end
end
# For extremely large XML files
parser = Nokogiri::XML::SAX::Parser.new(ProductParser.new)
parser.parse(File.open('large_file.xml'))
Common Patterns and Use Cases
Processing Navigation Menus
# Extract navigation structure
nav_items = doc.css('nav ul li').map do |item|
link = item.at('a')
{
text: link&.text&.strip,
href: link&.[]('href'),
active: item['class']&.include?('active')
}
end
Extracting Product Information
# E-commerce product listings
products = doc.css('.product-card').map do |card|
{
title: card.at('.product-title')&.text&.strip,
price: card.at('.price')&.text&.gsub(/[^\d.]/, '').to_f,
rating: card.css('.star.filled').length,
image_url: card.at('img')&.[]('src'),
availability: card.at('.stock-status')&.text&.strip
}
end
# Filter available products
available_products = products.select { |p| p[:availability] == 'In Stock' }
Processing Data Tables
# Extract table data with headers
table = doc.at('table')
headers = table.css('thead th').map(&:text)
data = table.css('tbody tr').map do |row|
cells = row.css('td').map(&:text)
Hash[headers.zip(cells)]
end
Integration with Data Storage
Saving to JSON
require 'json'
# Extract and save data
scraped_data = doc.css('.item').map do |item|
{
id: item['data-id'],
title: item.at('.title').text,
description: item.at('.description').text,
timestamp: Time.now.iso8601
}
end
# Save to JSON file
File.write('scraped_data.json', JSON.pretty_generate(scraped_data))
Database Integration
# Example with ActiveRecord (Rails)
doc.css('.article').each do |article_element|
Article.create!(
title: article_element.at('h2').text,
content: article_element.at('.content').text,
author: article_element.at('.author').text,
published_at: Time.parse(article_element.at('.date').text)
)
end
Conclusion
Iterating through collections in Nokogiri is straightforward thanks to Ruby's powerful enumerable methods. Whether you're extracting simple lists or processing complex nested structures, the combination of CSS selectors and enumerable methods provides a flexible and efficient approach to data extraction.
Key takeaways for effective iteration:
- Cache collections to avoid repeated DOM queries
- Choose the right enumerable method for your use case
- Handle missing elements with safe navigation
- Process large datasets efficiently with chunking and streaming
- Combine with error handling for robust scrapers
- Consider browser automation for JavaScript-heavy content
With these techniques, you'll be able to efficiently extract data from any HTML or XML document structure while maintaining clean, readable, and performant code.