What are the best practices for memory management with Nokogiri?
Memory management is crucial when working with Nokogiri, especially when processing large XML or HTML documents or running long-lived applications. Nokogiri uses libxml2 under the hood, which means proper memory management requires understanding both Ruby's garbage collection and the underlying C library's memory handling.
Understanding Nokogiri's Memory Model
Nokogiri creates C-level objects that are wrapped by Ruby objects. When Ruby's garbage collector runs, it may not immediately free the underlying C memory, which can lead to memory bloat in applications that process many documents.
Basic Memory Management Principles
# Good: Explicitly remove references
doc = Nokogiri::HTML(html_content)
# Process the document
results = doc.css('div.content').map(&:text)
doc = nil # Remove reference to help GC
# Good: Use blocks for automatic cleanup
File.open('large_file.xml') do |file|
doc = Nokogiri::XML(file)
# Process document within block scope
doc.css('item').each { |item| process_item(item) }
# Document goes out of scope automatically
end
Force Garbage Collection for Large Documents
When processing large documents or many documents in sequence, manually triggering garbage collection can help free memory more aggressively:
def process_large_documents(file_paths)
file_paths.each_with_index do |path, index|
doc = Nokogiri::XML(File.read(path))
extract_data(doc)
doc = nil
# Force GC every 10 documents
if (index + 1) % 10 == 0
GC.start
GC.compact if GC.respond_to?(:compact)
end
end
end
Use Streaming for Very Large Files
For extremely large XML files, consider using Nokogiri's SAX parser instead of DOM parsing to avoid loading the entire document into memory:
class MyHandler < Nokogiri::XML::SAX::Document
def start_element(name, attributes = [])
if name == 'target_element'
@current_element = {}
@inside_target = true
end
end
def characters(string)
if @inside_target
@current_element[:content] ||= ""
@current_element[:content] += string
end
end
def end_element(name)
if name == 'target_element'
process_element(@current_element)
@current_element = nil
@inside_target = false
end
end
private
def process_element(element)
# Process element data without keeping full document in memory
puts element[:content]
end
end
# Process large file with constant memory usage
parser = Nokogiri::XML::SAX::Parser.new(MyHandler.new)
parser.parse(File.open('very_large_file.xml'))
Limit Node Collections and Use Iterators
Instead of collecting all matching nodes at once, process them iteratively to reduce memory footprint:
# Memory-intensive: Creates large array
all_items = doc.css('item') # Could be thousands of nodes
results = all_items.map { |item| expensive_processing(item) }
# Memory-efficient: Process nodes one by one
results = []
doc.css('item').each do |item|
results << expensive_processing(item)
# Each node can be garbage collected after processing
end
# Even better: Use lazy evaluation
doc.css('item').lazy.map { |item| expensive_processing(item) }.force
Remove Nodes from Documents
When you no longer need certain parts of a document, explicitly remove them to free memory:
doc = Nokogiri::HTML(large_html)
# Remove unnecessary sections to reduce memory usage
doc.css('script, style, .advertisement').remove
# Process remaining content
content_nodes = doc.css('.content')
process_content(content_nodes)
# Clear the document
doc = nil
Handle Character Encoding Efficiently
Improper encoding handling can lead to memory issues. Always specify encoding when known:
# Good: Specify encoding to avoid conversion overhead
doc = Nokogiri::HTML(html_string, nil, 'UTF-8')
# Good: Handle encoding detection properly
detected_encoding = html_string.encoding.name
doc = Nokogiri::HTML(html_string, nil, detected_encoding)
# Avoid: Letting Nokogiri auto-detect encoding repeatedly
# This can cause memory overhead in loops
For more information about handling encoding issues specifically, see our guide on how to handle encoding issues in Nokogiri.
Optimize XPath and CSS Selectors
Inefficient selectors can cause memory issues by creating unnecessary intermediate collections:
# Memory-intensive: Multiple traversals
doc.css('div').css('.item').css('a')
# Memory-efficient: Single precise selector
doc.css('div .item a')
# Memory-intensive: Descendant selector on large documents
doc.css('table td')
# Memory-efficient: More specific selector
doc.css('table.data-table tbody td')
Use Connection Pooling for Web Scraping
When scraping multiple pages, reuse HTTP connections and manage document lifecycle properly:
require 'net/http'
require 'nokogiri'
class MemoryEfficientScraper
def initialize
@http = Net::HTTP.new('example.com', 80)
@http.start # Keep connection alive
end
def scrape_pages(urls)
urls.each do |url|
response = @http.get(url)
# Process each page independently
process_page(response.body)
# Ensure memory is freed
response = nil
GC.start if rand(10) == 0 # Occasional GC
end
ensure
@http.finish if @http.started?
end
private
def process_page(html)
doc = Nokogiri::HTML(html)
# Extract data efficiently
data = doc.css('.target-class').map do |node|
{
title: node.at_css('.title')&.text&.strip,
link: node.at_css('a')&.[]('href')
}
end
# Process data and clear document reference
save_data(data)
doc = nil
end
end
Monitor Memory Usage
Keep track of memory usage during development and production:
def monitor_memory_usage
before = `ps -o pid,rss -p #{Process.pid}`.split("\n").last.split.last.to_i
yield # Execute the block
after = `ps -o pid,rss -p #{Process.pid}`.split("\n").last.split.last.to_i
puts "Memory usage: #{after - before} KB increase"
end
# Usage
monitor_memory_usage do
doc = Nokogiri::HTML(large_html_content)
process_document(doc)
doc = nil
end
Advanced Memory Optimization Techniques
Use Fragment Parsing for Partial Content
When you only need specific parts of a document, use fragment parsing:
# Instead of parsing entire document
doc = Nokogiri::HTML(full_html)
target_content = doc.css('#specific-section').first
# Parse only the needed fragment
fragment_html = extract_section_html(full_html) # Custom extraction
fragment = Nokogiri::HTML::DocumentFragment.parse(fragment_html)
target_content = fragment.css('.target-class')
Implement Document Caching with Memory Limits
class DocumentCache
def initialize(max_size: 100)
@cache = {}
@max_size = max_size
@access_order = []
end
def get_or_parse(key, html_content)
if @cache.key?(key)
# Move to end (most recently used)
@access_order.delete(key)
@access_order.push(key)
return @cache[key]
end
# Parse new document
doc = Nokogiri::HTML(html_content)
# Evict oldest if at capacity
if @cache.size >= @max_size
oldest_key = @access_order.shift
@cache.delete(oldest_key)
end
@cache[key] = doc
@access_order.push(key)
doc
end
def clear
@cache.clear
@access_order.clear
GC.start
end
end
Performance Considerations for Large Documents
When dealing with large documents, understanding the performance implications of using Nokogiri for large documents becomes crucial for effective memory management.
Batch Processing Strategy
def process_large_xml_file(file_path, batch_size: 1000)
File.open(file_path) do |file|
# Use SAX parser to identify record boundaries
parser = RecordBoundaryParser.new
batch = []
file.each_line do |line|
if parser.record_start?(line)
# Process current batch if full
if batch.size >= batch_size
process_batch(batch)
batch.clear
GC.start # Force cleanup between batches
end
batch << parser.extract_record(line)
end
end
# Process remaining records
process_batch(batch) unless batch.empty?
end
end
Memory Management in Multi-threaded Applications
When using Nokogiri in multi-threaded applications, be extra careful about memory management:
require 'concurrent'
# Thread-safe document processing
def parallel_document_processing(documents)
# Limit concurrent threads to control memory usage
pool = Concurrent::FixedThreadPool.new(4)
documents.each do |doc_data|
pool.post do
begin
doc = Nokogiri::HTML(doc_data[:content])
result = process_document(doc)
doc_data[:result] = result
ensure
doc = nil
# Thread-local garbage collection
GC.start
end
end
end
pool.shutdown
pool.wait_for_termination
end
Best Practices Summary
- Always clear document references after processing
- Use SAX parsing for very large files to maintain constant memory usage
- Process nodes iteratively instead of collecting them all at once
- Force garbage collection periodically when processing many documents
- Remove unnecessary nodes from documents to reduce memory footprint
- Use specific CSS/XPath selectors to avoid creating large intermediate collections
- Monitor memory usage during development and in production
- Implement caching strategies with proper size limits
- Consider fragment parsing when you only need specific document sections
- Use connection pooling for web scraping to reduce overhead
By following these memory management best practices, you can build robust applications that handle large XML and HTML documents efficiently without running into memory-related issues. Regular monitoring and profiling will help you identify and address any memory bottlenecks in your specific use case.