Table of contents

What is the Proper Way to Close and Cleanup Nokogiri Documents?

Proper memory management is crucial when working with Nokogiri documents, especially in production applications that process large volumes of HTML or XML data. Unlike some programming languages that require explicit resource cleanup, Ruby's garbage collector handles most memory management automatically. However, Nokogiri documents can hold significant memory, and understanding proper cleanup techniques ensures optimal performance and prevents memory leaks.

Understanding Nokogiri Memory Management

Nokogiri is a Ruby wrapper around the C libraries libxml2 and libxslt. This means that Nokogiri objects contain references to C memory structures that exist outside of Ruby's normal garbage collection scope. While Ruby's garbage collector will eventually clean up these resources, being proactive about cleanup can significantly improve your application's memory efficiency.

Automatic Cleanup vs Manual Cleanup

By default, Nokogiri documents are automatically cleaned up when they go out of scope and Ruby's garbage collector runs:

def parse_document
  doc = Nokogiri::HTML(html_content)
  # Process the document
  extracted_data = doc.css('div.content').text
  return extracted_data
  # doc goes out of scope here and will be garbage collected
end

However, for better control over memory usage, especially when processing large documents or multiple documents in succession, manual cleanup is recommended.

Explicit Document Cleanup Methods

Using the remove Method

The most direct way to clean up a Nokogiri document is by calling the remove method on the document object:

require 'nokogiri'

# Parse a document
doc = Nokogiri::HTML(File.open('large_file.html'))

# Process the document
titles = doc.css('h1, h2, h3').map(&:text)

# Explicitly remove the document
doc.remove

# The document is now cleaned up and memory is freed
puts titles

Cleaning Up Node Collections

When working with node collections, you can also clean up individual nodes:

doc = Nokogiri::HTML(html_content)
nodes = doc.css('div.large-content')

nodes.each do |node|
  # Process the node
  process_node(node)
  # Clean up the individual node
  node.remove
end

# Clean up the entire document
doc.remove

Memory Management Best Practices

Use Blocks for Automatic Cleanup

One of the most effective patterns for ensuring cleanup is to use blocks that automatically handle resource management:

def with_nokogiri_document(html_content)
  doc = Nokogiri::HTML(html_content)
  begin
    yield(doc)
  ensure
    doc.remove if doc
  end
end

# Usage
data = with_nokogiri_document(html_content) do |doc|
  doc.css('table tr').map do |row|
    row.css('td').map(&:text)
  end
end

Process Documents in Batches

When processing multiple documents, clean up each document before moving to the next:

def process_multiple_files(file_paths)
  results = []

  file_paths.each do |path|
    doc = nil
    begin
      doc = Nokogiri::HTML(File.open(path))

      # Extract data
      data = {
        title: doc.at_css('title')&.text,
        links: doc.css('a').map { |link| link['href'] }
      }

      results << data

    ensure
      # Always clean up, even if an error occurs
      doc&.remove
    end
  end

  results
end

Monitoring Memory Usage

You can monitor memory usage to verify that cleanup is working effectively:

require 'nokogiri'

def measure_memory
  GC.start
  (GC.stat[:total_allocated_objects] * 40) / 1024 / 1024 # Rough MB estimate
end

puts "Initial memory: #{measure_memory} MB"

# Process documents with cleanup
100.times do |i|
  doc = Nokogiri::HTML("<html><body>#{'x' * 10000}</body></html>")
  # Process document
  doc.remove

  if i % 10 == 0
    puts "After #{i} documents: #{measure_memory} MB"
  end
end

Advanced Cleanup Techniques

Custom Cleanup Classes

For complex applications, consider creating wrapper classes that handle cleanup automatically:

class NokogiriProcessor
  def initialize(html_content)
    @doc = Nokogiri::HTML(html_content)
  end

  def extract_data
    {
      title: @doc.at_css('title')&.text,
      headings: @doc.css('h1, h2, h3').map(&:text),
      links: @doc.css('a[href]').map { |a| a['href'] }
    }
  end

  def cleanup
    @doc&.remove
    @doc = nil
  end

  # Automatically cleanup when object is garbage collected
  def finalize
    cleanup
  end

  private_class_method :new

  def self.process(html_content, &block)
    processor = new(html_content)
    begin
      block.call(processor)
    ensure
      processor.cleanup
    end
  end
end

# Usage
data = NokogiriProcessor.process(html_content) do |processor|
  processor.extract_data
end

SAX Parser for Large Documents

For extremely large XML documents, consider using Nokogiri's SAX parser, which doesn't load the entire document into memory:

class DataExtractor < Nokogiri::XML::SAX::Document
  attr_reader :extracted_data

  def initialize
    @extracted_data = []
    @current_element = nil
  end

  def start_element(name, attributes = [])
    @current_element = name
  end

  def characters(string)
    if @current_element == 'title'
      @extracted_data << string.strip
    end
  end

  def end_element(name)
    @current_element = nil
  end
end

# Process large XML file without loading entire document
parser = Nokogiri::XML::SAX::Parser.new(DataExtractor.new)
parser.parse(File.open('very_large_file.xml'))
data = parser.document.extracted_data

Integration with Web Scraping Workflows

When integrating Nokogiri cleanup with web scraping workflows, consider the interaction with HTTP clients and other resources. While tools like Puppeteer offer sophisticated browser session management, Nokogiri's cleanup is primarily about memory management rather than session handling.

Cleanup in Web Scraping Loops

require 'net/http'
require 'nokogiri'

def scrape_multiple_pages(urls)
  results = []

  urls.each do |url|
    doc = nil

    begin
      # Fetch the page
      response = Net::HTTP.get_response(URI(url))

      if response.code == '200'
        doc = Nokogiri::HTML(response.body)

        # Extract data
        page_data = {
          url: url,
          title: doc.at_css('title')&.text,
          description: doc.at_css('meta[name="description"]')&.[]('content')
        }

        results << page_data
      end

    rescue StandardError => e
      puts "Error processing #{url}: #{e.message}"

    ensure
      # Always cleanup the document
      doc&.remove

      # Optional: Force garbage collection periodically
      GC.start if results.length % 50 == 0
    end
  end

  results
end

Common Pitfalls and Solutions

Memory Leaks in Long-Running Processes

In long-running processes like web servers or background jobs, failing to clean up Nokogiri documents can lead to memory bloat:

# ❌ Bad: No cleanup in background job
class DataProcessingJob
  def perform(html_content)
    doc = Nokogiri::HTML(html_content)
    # Process data...
    # Document never gets cleaned up explicitly
  end
end

# ✅ Good: Explicit cleanup
class DataProcessingJob
  def perform(html_content)
    doc = nil

    begin
      doc = Nokogiri::HTML(html_content)
      # Process data...

    ensure
      doc&.remove
    end
  end
end

Error Handling During Cleanup

Ensure cleanup happens even when errors occur during document processing:

def safe_nokogiri_processing(html_content)
  doc = Nokogiri::HTML(html_content)

  begin
    # Risky operations that might raise exceptions
    complex_data_extraction(doc)

  rescue StandardError => e
    logger.error("Processing failed: #{e.message}")
    raise e

  ensure
    # This runs regardless of success or failure
    doc&.remove
  end
end

Performance Testing and Monitoring

To verify that your cleanup strategy is effective, implement monitoring:

class MemoryMonitor
  def self.measure_nokogiri_impact
    initial_memory = memory_usage

    yield

    final_memory = memory_usage
    memory_diff = final_memory - initial_memory

    puts "Memory change: #{memory_diff} MB"
    memory_diff
  end

  private

  def self.memory_usage
    `ps -o rss= -p #{Process.pid}`.to_i / 1024.0
  end
end

# Test cleanup effectiveness
MemoryMonitor.measure_nokogiri_impact do
  1000.times do
    doc = Nokogiri::HTML("<html><body>#{'x' * 1000}</body></html>")
    doc.css('body').text
    doc.remove
  end
end

Conclusion

Proper cleanup of Nokogiri documents is essential for maintaining optimal memory usage in Ruby applications. While Ruby's garbage collector will eventually clean up Nokogiri objects, explicit cleanup using the remove method provides immediate memory relief and prevents accumulation of large objects in memory.

Key recommendations include:

  1. Always use explicit cleanup with doc.remove for production applications
  2. Implement cleanup in ensure blocks to handle errors gracefully
  3. Use wrapper patterns for consistent cleanup across your application
  4. Monitor memory usage to verify cleanup effectiveness
  5. Consider SAX parsing for extremely large documents

By following these practices, you'll ensure that your Nokogiri-based applications remain memory-efficient and performant, even when processing large volumes of HTML or XML data.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon