How do I iterate through collections of elements in Nokogiri?

Iterating through collections of elements is a fundamental operation when web scraping with Nokogiri. Whether you're extracting data from multiple table rows, processing a list of articles, or working with navigation menus, understanding how to efficiently loop through element collections is essential for effective data extraction.

Understanding Nokogiri Collections

When you search for elements using Nokogiri's search, css, or xpath methods, you get back a Nokogiri::XML::NodeSet object. This collection behaves like a Ruby array and includes all the Enumerable methods you're familiar with.

require 'nokogiri'
require 'open-uri'

# Parse HTML document
html = <<-HTML
<div class="container">
  <div class="item">Item 1</div>
  <div class="item">Item 2</div>
  <div class="item">Item 3</div>
  <ul class="list">
    <li>Apple</li>
    <li>Banana</li>
    <li>Cherry</li>
  </ul>
</div>
HTML

doc = Nokogiri::HTML(html)
items = doc.css('.item')
puts items.class # => Nokogiri::XML::NodeSet
puts items.length # => 3

Basic Iteration with Each

The most common way to iterate through a collection is using the each method:

# Basic iteration
doc.css('.item').each do |item|
  puts item.text
end
# Output:
# Item 1
# Item 2
# Item 3

# With index
doc.css('.item').each_with_index do |item, index|
  puts "#{index + 1}: #{item.text}"
end
# Output:
# 1: Item 1
# 2: Item 2
# 3: Item 3

Using Map for Transformation

When you need to transform elements into a new collection, use map:

# Extract text content into an array
item_texts = doc.css('.item').map(&:text)
puts item_texts # => ["Item 1", "Item 2", "Item 3"]

# Extract specific attributes
links = doc.css('a').map { |link| link['href'] }

# More complex transformations
item_data = doc.css('.item').map do |item|
  {
    text: item.text.strip,
    class: item['class'],
    position: item.parent.children.index(item)
  }
end

Practical Examples

Extracting Table Data

html_table = <<-HTML
<table>
  <thead>
    <tr><th>Name</th><th>Age</th><th>City</th></tr>
  </thead>
  <tbody>
    <tr><td>John</td><td>25</td><td>New York</td></tr>
    <tr><td>Jane</td><td>30</td><td>London</td></tr>
    <tr><td>Bob</td><td>35</td><td>Paris</td></tr>
  </tbody>
</table>
HTML

doc = Nokogiri::HTML(html_table)

# Extract headers
headers = doc.css('thead th').map(&:text)

# Extract all rows
rows = doc.css('tbody tr').map do |row|
  cells = row.css('td').map(&:text)
  Hash[headers.zip(cells)]
end

puts rows
# Output:
# [
#   {"Name"=>"John", "Age"=>"25", "City"=>"New York"},
#   {"Name"=>"Jane", "Age"=>"30", "City"=>"London"},
#   {"Name"=>"Bob", "Age"=>"35", "City"=>"Paris"}
# ]

Processing Article Lists

html_articles = <<-HTML
<div class="articles">
  <article>
    <h2>First Article</h2>
    <p class="meta">By Author 1</p>
    <p>Article content...</p>
  </article>
  <article>
    <h2>Second Article</h2>
    <p class="meta">By Author 2</p>
    <p>Different content...</p>
  </article>
</div>
HTML

doc = Nokogiri::HTML(html_articles)

articles = doc.css('article').map do |article|
  {
    title: article.at('h2')&.text&.strip,
    author: article.at('.meta')&.text&.strip,
    content: article.css('p:not(.meta)').map(&:text).join(' ')
  }
end

articles.each do |article|
  puts "Title: #{article[:title]}"
  puts "Author: #{article[:author]}"
  puts "Content: #{article[:content][0..50]}..."
  puts "---"
end

Advanced Iteration Techniques

Filtering While Iterating

# Using select to filter elements
active_items = doc.css('.item').select { |item| item['class']&.include?('active') }

# Using reject to exclude elements
non_empty_paragraphs = doc.css('p').reject { |p| p.text.strip.empty? }

# Combining with other enumerable methods
first_three_items = doc.css('.item').first(3)
last_item = doc.css('.item').last

Nested Iterations

html_nested = <<-HTML
<div class="sections">
  <section>
    <h3>Section 1</h3>
    <ul>
      <li>Item A</li>
      <li>Item B</li>
    </ul>
  </section>
  <section>
    <h3>Section 2</h3>
    <ul>
      <li>Item C</li>
      <li>Item D</li>
    </ul>
  </section>
</div>
HTML

doc = Nokogiri::HTML(html_nested)

doc.css('section').each do |section|
  title = section.at('h3').text
  puts "Section: #{title}"

  section.css('li').each do |item|
    puts "  - #{item.text}"
  end
end

Working with Complex Structures

# Processing form elements
doc.css('form').each do |form|
  form_data = {
    action: form['action'],
    method: form['method'] || 'GET',
    fields: []
  }

  form.css('input, select, textarea').each do |field|
    field_info = {
      name: field['name'],
      type: field['type'] || field.name,
      required: field.key?('required')
    }

    form_data[:fields] << field_info
  end

  puts form_data
end

Performance Considerations

Efficient Element Access

# Inefficient - searches the entire document each time
doc.css('.item').each do |item|
  siblings = doc.css('.item') # Don't do this!
  # Process item...
end

# Efficient - cache the collection
items = doc.css('.item')
items.each do |item|
  # Use the cached collection
  # Process item...
end

Memory-Efficient Processing

# For large documents, consider processing in chunks
def process_large_collection(doc, selector, chunk_size = 100)
  elements = doc.css(selector)

  elements.each_slice(chunk_size) do |chunk|
    chunk.each do |element|
      # Process element
      yield element if block_given?
    end

    # Optional: garbage collection for very large datasets
    GC.start if chunk_size > 1000
  end
end

# Usage
process_large_collection(doc, '.item', 50) do |item|
  puts item.text
end

Error Handling in Iterations

doc.css('.item').each do |item|
  begin
    # Extract data that might fail
    title = item.at('h2').text
    date = Date.parse(item.at('.date').text)

    puts "#{title} - #{date}"
  rescue StandardError => e
    puts "Error processing item: #{e.message}"
    # Log the problematic HTML for debugging
    puts "Problematic HTML: #{item.to_html}"
    next # Continue with the next item
  end
end

JavaScript-Heavy Content Considerations

While Nokogiri excels at parsing static HTML content, modern web applications often load data dynamically through JavaScript. For sites that require JavaScript execution, you might need to combine Nokogiri with browser automation tools. Learn more about handling dynamic content that loads after page load or navigating through multiple pages for comprehensive scraping workflows.

Best Practices for Element Iteration

1. Cache Collections for Performance

# Cache the NodeSet instead of re-querying
products = doc.css('.product')

products.each do |product|
  name = product.at('.name').text
  price = product.at('.price').text
  # Process each product...
end

2. Use Appropriate Enumerable Methods

# Use each for side effects (printing, saving to database)
doc.css('.item').each { |item| puts item.text }

# Use map for transformations
item_data = doc.css('.item').map { |item| item.text.strip }

# Use select for filtering
active_items = doc.css('.item').select { |item| item['class'].include?('active') }

# Use find for single matching element
first_match = doc.css('.item').find { |item| item.text.include?('search term') }

3. Handle Missing Elements Gracefully

doc.css('.product').each do |product|
  # Use safe navigation to handle missing elements
  name = product.at('.name')&.text&.strip
  price = product.at('.price')&.text&.strip

  # Skip products with missing required data
  next unless name && price

  puts "#{name}: #{price}"
end

4. Optimize for Large Documents

# Use streaming for very large XML documents
require 'nokogiri'

class ProductParser < Nokogiri::XML::SAX::Document
  def initialize
    @products = []
  end

  def start_element(name, attributes = [])
    if name == 'product'
      @current_product = {}
    end
  end

  def characters(string)
    # Process character data
  end

  def end_element(name)
    if name == 'product'
      @products << @current_product
    end
  end
end

# For extremely large XML files
parser = Nokogiri::XML::SAX::Parser.new(ProductParser.new)
parser.parse(File.open('large_file.xml'))

Common Patterns and Use Cases

Processing Navigation Menus

# Extract navigation structure
nav_items = doc.css('nav ul li').map do |item|
  link = item.at('a')
  {
    text: link&.text&.strip,
    href: link&.[]('href'),
    active: item['class']&.include?('active')
  }
end

Extracting Product Information

# E-commerce product listings
products = doc.css('.product-card').map do |card|
  {
    title: card.at('.product-title')&.text&.strip,
    price: card.at('.price')&.text&.gsub(/[^\d.]/, '').to_f,
    rating: card.css('.star.filled').length,
    image_url: card.at('img')&.[]('src'),
    availability: card.at('.stock-status')&.text&.strip
  }
end

# Filter available products
available_products = products.select { |p| p[:availability] == 'In Stock' }

Processing Data Tables

# Extract table data with headers
table = doc.at('table')
headers = table.css('thead th').map(&:text)

data = table.css('tbody tr').map do |row|
  cells = row.css('td').map(&:text)
  Hash[headers.zip(cells)]
end

Integration with Data Storage

Saving to JSON

require 'json'

# Extract and save data
scraped_data = doc.css('.item').map do |item|
  {
    id: item['data-id'],
    title: item.at('.title').text,
    description: item.at('.description').text,
    timestamp: Time.now.iso8601
  }
end

# Save to JSON file
File.write('scraped_data.json', JSON.pretty_generate(scraped_data))

Database Integration

# Example with ActiveRecord (Rails)
doc.css('.article').each do |article_element|
  Article.create!(
    title: article_element.at('h2').text,
    content: article_element.at('.content').text,
    author: article_element.at('.author').text,
    published_at: Time.parse(article_element.at('.date').text)
  )
end

Conclusion

Iterating through collections in Nokogiri is straightforward thanks to Ruby's powerful enumerable methods. Whether you're extracting simple lists or processing complex nested structures, the combination of CSS selectors and enumerable methods provides a flexible and efficient approach to data extraction.

Key takeaways for effective iteration:

Cache collections to avoid repeated DOM queries
Choose the right enumerable method for your use case
Handle missing elements with safe navigation
Process large datasets efficiently with chunking and streaming
Combine with error handling for robust scrapers
Consider browser automation for JavaScript-heavy content

With these techniques, you'll be able to efficiently extract data from any HTML or XML document structure while maintaining clean, readable, and performant code.

Table of contents