Table of contents

Threading Considerations When Using Nokogiri

When building high-performance web scraping applications in Ruby, understanding how Nokogiri behaves in multi-threaded environments is crucial for both performance and stability. Nokogiri, being a wrapper around native C libraries (libxml2 and libxslt), has specific threading characteristics that developers need to consider.

Nokogiri's Thread Safety Model

Nokogiri documents and nodes are not thread-safe. This means that sharing Nokogiri objects between threads without proper synchronization can lead to crashes, memory corruption, or unpredictable behavior. However, the Nokogiri parsing methods themselves are generally thread-safe, allowing multiple threads to parse different documents simultaneously.

Key Threading Rules

  1. Never share documents between threads - Each thread should have its own Nokogiri document instances
  2. Parser methods are thread-safe - Multiple threads can call Nokogiri::HTML() or Nokogiri::XML() concurrently
  3. Memory management requires attention - Proper cleanup is essential in threaded environments

Thread-Safe Parsing Patterns

Basic Multi-threaded Parsing

Here's a safe approach to parsing multiple HTML documents in parallel:

require 'nokogiri'
require 'thread'
require 'net/http'

urls = [
  'https://example.com/page1',
  'https://example.com/page2',
  'https://example.com/page3'
]

# Thread-safe parsing with individual documents per thread
threads = []
results = Queue.new

urls.each do |url|
  threads << Thread.new do
    begin
      # Each thread gets its own HTTP connection and document
      uri = URI(url)
      response = Net::HTTP.get_response(uri)

      # Parse in the current thread - this is thread-safe
      doc = Nokogiri::HTML(response.body)

      # Extract data using thread-local document
      title = doc.at_css('title')&.text
      links = doc.css('a').map { |link| link['href'] }

      results << {
        url: url,
        title: title,
        links: links,
        thread_id: Thread.current.object_id
      }
    rescue => e
      puts "Error processing #{url}: #{e.message}"
    end
  end
end

# Wait for all threads to complete
threads.each(&:join)

# Collect results
parsed_results = []
until results.empty?
  parsed_results << results.pop
end

puts "Parsed #{parsed_results.length} pages across #{threads.length} threads"

Thread Pool Pattern

For better resource management, use a thread pool approach:

require 'nokogiri'
require 'concurrent-ruby'

class NokogiriThreadPool
  def initialize(pool_size: 4)
    @executor = Concurrent::ThreadPoolExecutor.new(
      min_threads: 1,
      max_threads: pool_size,
      max_queue: 0,
      fallback_policy: :caller_runs
    )
  end

  def parse_urls(urls)
    futures = urls.map do |url|
      Concurrent::Future.execute(executor: @executor) do
        parse_single_url(url)
      end
    end

    # Wait for all futures to complete and collect results
    futures.map(&:value)
  end

  private

  def parse_single_url(url)
    uri = URI(url)
    response = Net::HTTP.get_response(uri)

    # Each thread gets its own document instance
    doc = Nokogiri::HTML(response.body)

    {
      url: url,
      title: doc.at_css('title')&.text,
      meta_description: doc.at_css('meta[name="description"]')&.[]('content'),
      headings: doc.css('h1, h2, h3').map(&:text),
      processed_at: Time.now,
      thread_id: Thread.current.object_id
    }
  rescue => e
    { url: url, error: e.message }
  end

  def shutdown
    @executor.shutdown
    @executor.wait_for_termination(30)
  end
end

# Usage
pool = NokogiriThreadPool.new(pool_size: 8)
urls = ['https://example.com'] * 20

results = pool.parse_urls(urls)
pool.shutdown

puts "Successfully parsed #{results.count { |r| !r[:error] }} URLs"

Memory Management in Threaded Environments

Document Lifecycle Management

Proper memory management becomes critical in threaded applications where multiple documents are created and destroyed:

class ThreadSafeParser
  def self.parse_with_cleanup(html_content)
    doc = nil
    begin
      doc = Nokogiri::HTML(html_content)

      # Perform parsing operations
      data = extract_data(doc)

      return data
    ensure
      # Explicitly free document memory
      doc&.remove
      doc = nil

      # Force garbage collection periodically in long-running threads
      GC.start if rand(100) < 5
    end
  end

  private

  def self.extract_data(doc)
    {
      title: doc.at_css('title')&.text,
      paragraphs: doc.css('p').map(&:text),
      images: doc.css('img').map { |img| img['src'] }
    }
  end
end

Monitoring Memory Usage

For production applications, implement memory monitoring:

require 'get_process_mem'

class MemoryAwareParser
  def initialize(memory_limit_mb: 512)
    @memory_limit = memory_limit_mb * 1024 * 1024 # Convert to bytes
    @mem_monitor = GetProcessMem.new
  end

  def parse_safely(html_content)
    check_memory_usage

    doc = Nokogiri::HTML(html_content)
    result = yield(doc) if block_given?

    doc.remove
    result
  end

  private

  def check_memory_usage
    current_memory = @mem_monitor.bytes

    if current_memory > @memory_limit
      GC.start

      # Check again after GC
      if @mem_monitor.bytes > @memory_limit
        raise "Memory limit exceeded: #{current_memory / 1024 / 1024}MB"
      end
    end
  end
end

Common Threading Pitfalls

Sharing Documents Between Threads (DON'T DO THIS)

# DANGEROUS - Never do this!
doc = Nokogiri::HTML(html_content)

threads = []
10.times do |i|
  threads << Thread.new do
    # This will cause crashes and corruption
    title = doc.at_css('title')&.text  # UNSAFE!
    links = doc.css('a')               # UNSAFE!
  end
end

threads.each(&:join)

Proper Thread Isolation

# SAFE - Each thread gets its own document
html_content = fetch_html_content()

threads = []
10.times do |i|
  threads << Thread.new do
    # Parse in each thread - this is safe
    local_doc = Nokogiri::HTML(html_content)

    title = local_doc.at_css('title')&.text  # SAFE
    links = local_doc.css('a')               # SAFE

    # Clean up
    local_doc.remove
  end
end

threads.each(&:join)

Performance Optimization Strategies

Connection Pooling

When scraping multiple pages, combine threading with connection pooling for optimal performance:

require 'nokogiri'
require 'net/http/persistent'

class OptimizedScraper
  def initialize(thread_count: 4)
    @thread_count = thread_count
    @http_pool = Concurrent::Hash.new do |hash, key|
      hash[key] = Net::HTTP::Persistent.new(name: "scraper_#{key}")
    end
  end

  def scrape_urls(urls)
    url_queue = Queue.new
    urls.each { |url| url_queue << url }

    results = Queue.new
    threads = []

    @thread_count.times do |thread_id|
      threads << Thread.new do
        http_client = @http_pool[thread_id]

        until url_queue.empty?
          begin
            url = url_queue.pop(true) # Non-blocking pop

            uri = URI(url)
            response = http_client.request(uri)

            # Each thread gets its own document
            doc = Nokogiri::HTML(response.body)

            result = process_document(doc, url)
            results << result

            doc.remove
          rescue ThreadError
            # Queue is empty
            break
          rescue => e
            results << { url: url, error: e.message }
          end
        end
      end
    end

    threads.each(&:join)

    # Collect all results
    collected_results = []
    until results.empty?
      collected_results << results.pop
    end

    collected_results
  end

  private

  def process_document(doc, url)
    {
      url: url,
      title: doc.at_css('title')&.text,
      word_count: doc.text.split.length,
      link_count: doc.css('a').length
    }
  end
end

Integration with Web Scraping Workflows

For complex scraping operations that require JavaScript rendering, you might need to combine Nokogiri with headless browsers. While tools like how to run multiple pages in parallel with Puppeteer can handle JavaScript-heavy sites, Nokogiri remains excellent for parsing the resulting HTML in a thread-safe manner.

When dealing with dynamic content, you might also want to consider implementing proper browser session management before passing the rendered HTML to Nokogiri for parsing.

Best Practices Summary

  1. Always create separate document instances per thread - Never share Nokogiri documents between threads
  2. Use thread pools - Limit the number of concurrent threads to prevent resource exhaustion
  3. Implement proper cleanup - Call .remove on documents when done and consider periodic garbage collection
  4. Monitor memory usage - Set limits and monitor memory consumption in long-running applications
  5. Handle errors gracefully - Wrap parsing operations in begin/rescue blocks
  6. Consider connection pooling - Reuse HTTP connections when scraping multiple URLs
  7. Test under load - Always test multi-threaded scraping code under realistic load conditions

Debugging Threading Issues

When troubleshooting threading problems with Nokogiri:

# Enable detailed logging
Thread.abort_on_exception = true

# Add thread identification to logs
def log_with_thread(message)
  puts "[Thread #{Thread.current.object_id}] #{message}"
end

# Monitor document creation/destruction
class DocumentTracker
  @@created = Concurrent::AtomicFixnum.new(0)
  @@destroyed = Concurrent::AtomicFixnum.new(0)

  def self.track_creation
    @@created.increment
    puts "Documents created: #{@@created.value}"
  end

  def self.track_destruction
    @@destroyed.increment
    puts "Documents destroyed: #{@@destroyed.value}"
  end

  def self.stats
    puts "Created: #{@@created.value}, Destroyed: #{@@destroyed.value}"
  end
end

Conclusion

Threading with Nokogiri requires careful attention to document isolation and memory management. By following these patterns and best practices, you can build robust, high-performance web scraping applications that scale effectively across multiple threads while maintaining stability and memory efficiency. Remember that the key to success is never sharing Nokogiri objects between threads and always implementing proper cleanup procedures.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon