Table of contents

What are the best practices for memory management with Nokogiri?

Memory management is crucial when working with Nokogiri, especially when processing large XML or HTML documents or running long-lived applications. Nokogiri uses libxml2 under the hood, which means proper memory management requires understanding both Ruby's garbage collection and the underlying C library's memory handling.

Understanding Nokogiri's Memory Model

Nokogiri creates C-level objects that are wrapped by Ruby objects. When Ruby's garbage collector runs, it may not immediately free the underlying C memory, which can lead to memory bloat in applications that process many documents.

Basic Memory Management Principles

# Good: Explicitly remove references
doc = Nokogiri::HTML(html_content)
# Process the document
results = doc.css('div.content').map(&:text)
doc = nil  # Remove reference to help GC

# Good: Use blocks for automatic cleanup
File.open('large_file.xml') do |file|
  doc = Nokogiri::XML(file)
  # Process document within block scope
  doc.css('item').each { |item| process_item(item) }
  # Document goes out of scope automatically
end

Force Garbage Collection for Large Documents

When processing large documents or many documents in sequence, manually triggering garbage collection can help free memory more aggressively:

def process_large_documents(file_paths)
  file_paths.each_with_index do |path, index|
    doc = Nokogiri::XML(File.read(path))
    extract_data(doc)
    doc = nil

    # Force GC every 10 documents
    if (index + 1) % 10 == 0
      GC.start
      GC.compact if GC.respond_to?(:compact)
    end
  end
end

Use Streaming for Very Large Files

For extremely large XML files, consider using Nokogiri's SAX parser instead of DOM parsing to avoid loading the entire document into memory:

class MyHandler < Nokogiri::XML::SAX::Document
  def start_element(name, attributes = [])
    if name == 'target_element'
      @current_element = {}
      @inside_target = true
    end
  end

  def characters(string)
    if @inside_target
      @current_element[:content] ||= ""
      @current_element[:content] += string
    end
  end

  def end_element(name)
    if name == 'target_element'
      process_element(@current_element)
      @current_element = nil
      @inside_target = false
    end
  end

  private

  def process_element(element)
    # Process element data without keeping full document in memory
    puts element[:content]
  end
end

# Process large file with constant memory usage
parser = Nokogiri::XML::SAX::Parser.new(MyHandler.new)
parser.parse(File.open('very_large_file.xml'))

Limit Node Collections and Use Iterators

Instead of collecting all matching nodes at once, process them iteratively to reduce memory footprint:

# Memory-intensive: Creates large array
all_items = doc.css('item')  # Could be thousands of nodes
results = all_items.map { |item| expensive_processing(item) }

# Memory-efficient: Process nodes one by one
results = []
doc.css('item').each do |item|
  results << expensive_processing(item)
  # Each node can be garbage collected after processing
end

# Even better: Use lazy evaluation
doc.css('item').lazy.map { |item| expensive_processing(item) }.force

Remove Nodes from Documents

When you no longer need certain parts of a document, explicitly remove them to free memory:

doc = Nokogiri::HTML(large_html)

# Remove unnecessary sections to reduce memory usage
doc.css('script, style, .advertisement').remove

# Process remaining content
content_nodes = doc.css('.content')
process_content(content_nodes)

# Clear the document
doc = nil

Handle Character Encoding Efficiently

Improper encoding handling can lead to memory issues. Always specify encoding when known:

# Good: Specify encoding to avoid conversion overhead
doc = Nokogiri::HTML(html_string, nil, 'UTF-8')

# Good: Handle encoding detection properly
detected_encoding = html_string.encoding.name
doc = Nokogiri::HTML(html_string, nil, detected_encoding)

# Avoid: Letting Nokogiri auto-detect encoding repeatedly
# This can cause memory overhead in loops

For more information about handling encoding issues specifically, see our guide on how to handle encoding issues in Nokogiri.

Optimize XPath and CSS Selectors

Inefficient selectors can cause memory issues by creating unnecessary intermediate collections:

# Memory-intensive: Multiple traversals
doc.css('div').css('.item').css('a')

# Memory-efficient: Single precise selector
doc.css('div .item a')

# Memory-intensive: Descendant selector on large documents
doc.css('table td')

# Memory-efficient: More specific selector
doc.css('table.data-table tbody td')

Use Connection Pooling for Web Scraping

When scraping multiple pages, reuse HTTP connections and manage document lifecycle properly:

require 'net/http'
require 'nokogiri'

class MemoryEfficientScraper
  def initialize
    @http = Net::HTTP.new('example.com', 80)
    @http.start  # Keep connection alive
  end

  def scrape_pages(urls)
    urls.each do |url|
      response = @http.get(url)

      # Process each page independently
      process_page(response.body)

      # Ensure memory is freed
      response = nil
      GC.start if rand(10) == 0  # Occasional GC
    end
  ensure
    @http.finish if @http.started?
  end

  private

  def process_page(html)
    doc = Nokogiri::HTML(html)

    # Extract data efficiently
    data = doc.css('.target-class').map do |node|
      {
        title: node.at_css('.title')&.text&.strip,
        link: node.at_css('a')&.[]('href')
      }
    end

    # Process data and clear document reference
    save_data(data)
    doc = nil
  end
end

Monitor Memory Usage

Keep track of memory usage during development and production:

def monitor_memory_usage
  before = `ps -o pid,rss -p #{Process.pid}`.split("\n").last.split.last.to_i

  yield  # Execute the block

  after = `ps -o pid,rss -p #{Process.pid}`.split("\n").last.split.last.to_i
  puts "Memory usage: #{after - before} KB increase"
end

# Usage
monitor_memory_usage do
  doc = Nokogiri::HTML(large_html_content)
  process_document(doc)
  doc = nil
end

Advanced Memory Optimization Techniques

Use Fragment Parsing for Partial Content

When you only need specific parts of a document, use fragment parsing:

# Instead of parsing entire document
doc = Nokogiri::HTML(full_html)
target_content = doc.css('#specific-section').first

# Parse only the needed fragment
fragment_html = extract_section_html(full_html)  # Custom extraction
fragment = Nokogiri::HTML::DocumentFragment.parse(fragment_html)
target_content = fragment.css('.target-class')

Implement Document Caching with Memory Limits

class DocumentCache
  def initialize(max_size: 100)
    @cache = {}
    @max_size = max_size
    @access_order = []
  end

  def get_or_parse(key, html_content)
    if @cache.key?(key)
      # Move to end (most recently used)
      @access_order.delete(key)
      @access_order.push(key)
      return @cache[key]
    end

    # Parse new document
    doc = Nokogiri::HTML(html_content)

    # Evict oldest if at capacity
    if @cache.size >= @max_size
      oldest_key = @access_order.shift
      @cache.delete(oldest_key)
    end

    @cache[key] = doc
    @access_order.push(key)
    doc
  end

  def clear
    @cache.clear
    @access_order.clear
    GC.start
  end
end

Performance Considerations for Large Documents

When dealing with large documents, understanding the performance implications of using Nokogiri for large documents becomes crucial for effective memory management.

Batch Processing Strategy

def process_large_xml_file(file_path, batch_size: 1000)
  File.open(file_path) do |file|
    # Use SAX parser to identify record boundaries
    parser = RecordBoundaryParser.new

    batch = []

    file.each_line do |line|
      if parser.record_start?(line)
        # Process current batch if full
        if batch.size >= batch_size
          process_batch(batch)
          batch.clear
          GC.start  # Force cleanup between batches
        end

        batch << parser.extract_record(line)
      end
    end

    # Process remaining records
    process_batch(batch) unless batch.empty?
  end
end

Memory Management in Multi-threaded Applications

When using Nokogiri in multi-threaded applications, be extra careful about memory management:

require 'concurrent'

# Thread-safe document processing
def parallel_document_processing(documents)
  # Limit concurrent threads to control memory usage
  pool = Concurrent::FixedThreadPool.new(4)

  documents.each do |doc_data|
    pool.post do
      begin
        doc = Nokogiri::HTML(doc_data[:content])
        result = process_document(doc)
        doc_data[:result] = result
      ensure
        doc = nil
        # Thread-local garbage collection
        GC.start
      end
    end
  end

  pool.shutdown
  pool.wait_for_termination
end

Best Practices Summary

  1. Always clear document references after processing
  2. Use SAX parsing for very large files to maintain constant memory usage
  3. Process nodes iteratively instead of collecting them all at once
  4. Force garbage collection periodically when processing many documents
  5. Remove unnecessary nodes from documents to reduce memory footprint
  6. Use specific CSS/XPath selectors to avoid creating large intermediate collections
  7. Monitor memory usage during development and in production
  8. Implement caching strategies with proper size limits
  9. Consider fragment parsing when you only need specific document sections
  10. Use connection pooling for web scraping to reduce overhead

By following these memory management best practices, you can build robust applications that handle large XML and HTML documents efficiently without running into memory-related issues. Regular monitoring and profiling will help you identify and address any memory bottlenecks in your specific use case.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon