Threading Considerations When Using Nokogiri
When building high-performance web scraping applications in Ruby, understanding how Nokogiri behaves in multi-threaded environments is crucial for both performance and stability. Nokogiri, being a wrapper around native C libraries (libxml2 and libxslt), has specific threading characteristics that developers need to consider.
Nokogiri's Thread Safety Model
Nokogiri documents and nodes are not thread-safe. This means that sharing Nokogiri objects between threads without proper synchronization can lead to crashes, memory corruption, or unpredictable behavior. However, the Nokogiri parsing methods themselves are generally thread-safe, allowing multiple threads to parse different documents simultaneously.
Key Threading Rules
- Never share documents between threads - Each thread should have its own Nokogiri document instances
- Parser methods are thread-safe - Multiple threads can call
Nokogiri::HTML()
orNokogiri::XML()
concurrently - Memory management requires attention - Proper cleanup is essential in threaded environments
Thread-Safe Parsing Patterns
Basic Multi-threaded Parsing
Here's a safe approach to parsing multiple HTML documents in parallel:
require 'nokogiri'
require 'thread'
require 'net/http'
urls = [
'https://example.com/page1',
'https://example.com/page2',
'https://example.com/page3'
]
# Thread-safe parsing with individual documents per thread
threads = []
results = Queue.new
urls.each do |url|
threads << Thread.new do
begin
# Each thread gets its own HTTP connection and document
uri = URI(url)
response = Net::HTTP.get_response(uri)
# Parse in the current thread - this is thread-safe
doc = Nokogiri::HTML(response.body)
# Extract data using thread-local document
title = doc.at_css('title')&.text
links = doc.css('a').map { |link| link['href'] }
results << {
url: url,
title: title,
links: links,
thread_id: Thread.current.object_id
}
rescue => e
puts "Error processing #{url}: #{e.message}"
end
end
end
# Wait for all threads to complete
threads.each(&:join)
# Collect results
parsed_results = []
until results.empty?
parsed_results << results.pop
end
puts "Parsed #{parsed_results.length} pages across #{threads.length} threads"
Thread Pool Pattern
For better resource management, use a thread pool approach:
require 'nokogiri'
require 'concurrent-ruby'
class NokogiriThreadPool
def initialize(pool_size: 4)
@executor = Concurrent::ThreadPoolExecutor.new(
min_threads: 1,
max_threads: pool_size,
max_queue: 0,
fallback_policy: :caller_runs
)
end
def parse_urls(urls)
futures = urls.map do |url|
Concurrent::Future.execute(executor: @executor) do
parse_single_url(url)
end
end
# Wait for all futures to complete and collect results
futures.map(&:value)
end
private
def parse_single_url(url)
uri = URI(url)
response = Net::HTTP.get_response(uri)
# Each thread gets its own document instance
doc = Nokogiri::HTML(response.body)
{
url: url,
title: doc.at_css('title')&.text,
meta_description: doc.at_css('meta[name="description"]')&.[]('content'),
headings: doc.css('h1, h2, h3').map(&:text),
processed_at: Time.now,
thread_id: Thread.current.object_id
}
rescue => e
{ url: url, error: e.message }
end
def shutdown
@executor.shutdown
@executor.wait_for_termination(30)
end
end
# Usage
pool = NokogiriThreadPool.new(pool_size: 8)
urls = ['https://example.com'] * 20
results = pool.parse_urls(urls)
pool.shutdown
puts "Successfully parsed #{results.count { |r| !r[:error] }} URLs"
Memory Management in Threaded Environments
Document Lifecycle Management
Proper memory management becomes critical in threaded applications where multiple documents are created and destroyed:
class ThreadSafeParser
def self.parse_with_cleanup(html_content)
doc = nil
begin
doc = Nokogiri::HTML(html_content)
# Perform parsing operations
data = extract_data(doc)
return data
ensure
# Explicitly free document memory
doc&.remove
doc = nil
# Force garbage collection periodically in long-running threads
GC.start if rand(100) < 5
end
end
private
def self.extract_data(doc)
{
title: doc.at_css('title')&.text,
paragraphs: doc.css('p').map(&:text),
images: doc.css('img').map { |img| img['src'] }
}
end
end
Monitoring Memory Usage
For production applications, implement memory monitoring:
require 'get_process_mem'
class MemoryAwareParser
def initialize(memory_limit_mb: 512)
@memory_limit = memory_limit_mb * 1024 * 1024 # Convert to bytes
@mem_monitor = GetProcessMem.new
end
def parse_safely(html_content)
check_memory_usage
doc = Nokogiri::HTML(html_content)
result = yield(doc) if block_given?
doc.remove
result
end
private
def check_memory_usage
current_memory = @mem_monitor.bytes
if current_memory > @memory_limit
GC.start
# Check again after GC
if @mem_monitor.bytes > @memory_limit
raise "Memory limit exceeded: #{current_memory / 1024 / 1024}MB"
end
end
end
end
Common Threading Pitfalls
Sharing Documents Between Threads (DON'T DO THIS)
# DANGEROUS - Never do this!
doc = Nokogiri::HTML(html_content)
threads = []
10.times do |i|
threads << Thread.new do
# This will cause crashes and corruption
title = doc.at_css('title')&.text # UNSAFE!
links = doc.css('a') # UNSAFE!
end
end
threads.each(&:join)
Proper Thread Isolation
# SAFE - Each thread gets its own document
html_content = fetch_html_content()
threads = []
10.times do |i|
threads << Thread.new do
# Parse in each thread - this is safe
local_doc = Nokogiri::HTML(html_content)
title = local_doc.at_css('title')&.text # SAFE
links = local_doc.css('a') # SAFE
# Clean up
local_doc.remove
end
end
threads.each(&:join)
Performance Optimization Strategies
Connection Pooling
When scraping multiple pages, combine threading with connection pooling for optimal performance:
require 'nokogiri'
require 'net/http/persistent'
class OptimizedScraper
def initialize(thread_count: 4)
@thread_count = thread_count
@http_pool = Concurrent::Hash.new do |hash, key|
hash[key] = Net::HTTP::Persistent.new(name: "scraper_#{key}")
end
end
def scrape_urls(urls)
url_queue = Queue.new
urls.each { |url| url_queue << url }
results = Queue.new
threads = []
@thread_count.times do |thread_id|
threads << Thread.new do
http_client = @http_pool[thread_id]
until url_queue.empty?
begin
url = url_queue.pop(true) # Non-blocking pop
uri = URI(url)
response = http_client.request(uri)
# Each thread gets its own document
doc = Nokogiri::HTML(response.body)
result = process_document(doc, url)
results << result
doc.remove
rescue ThreadError
# Queue is empty
break
rescue => e
results << { url: url, error: e.message }
end
end
end
end
threads.each(&:join)
# Collect all results
collected_results = []
until results.empty?
collected_results << results.pop
end
collected_results
end
private
def process_document(doc, url)
{
url: url,
title: doc.at_css('title')&.text,
word_count: doc.text.split.length,
link_count: doc.css('a').length
}
end
end
Integration with Web Scraping Workflows
For complex scraping operations that require JavaScript rendering, you might need to combine Nokogiri with headless browsers. While tools like how to run multiple pages in parallel with Puppeteer can handle JavaScript-heavy sites, Nokogiri remains excellent for parsing the resulting HTML in a thread-safe manner.
When dealing with dynamic content, you might also want to consider implementing proper browser session management before passing the rendered HTML to Nokogiri for parsing.
Best Practices Summary
- Always create separate document instances per thread - Never share Nokogiri documents between threads
- Use thread pools - Limit the number of concurrent threads to prevent resource exhaustion
- Implement proper cleanup - Call
.remove
on documents when done and consider periodic garbage collection - Monitor memory usage - Set limits and monitor memory consumption in long-running applications
- Handle errors gracefully - Wrap parsing operations in begin/rescue blocks
- Consider connection pooling - Reuse HTTP connections when scraping multiple URLs
- Test under load - Always test multi-threaded scraping code under realistic load conditions
Debugging Threading Issues
When troubleshooting threading problems with Nokogiri:
# Enable detailed logging
Thread.abort_on_exception = true
# Add thread identification to logs
def log_with_thread(message)
puts "[Thread #{Thread.current.object_id}] #{message}"
end
# Monitor document creation/destruction
class DocumentTracker
@@created = Concurrent::AtomicFixnum.new(0)
@@destroyed = Concurrent::AtomicFixnum.new(0)
def self.track_creation
@@created.increment
puts "Documents created: #{@@created.value}"
end
def self.track_destruction
@@destroyed.increment
puts "Documents destroyed: #{@@destroyed.value}"
end
def self.stats
puts "Created: #{@@created.value}, Destroyed: #{@@destroyed.value}"
end
end
Conclusion
Threading with Nokogiri requires careful attention to document isolation and memory management. By following these patterns and best practices, you can build robust, high-performance web scraping applications that scale effectively across multiple threads while maintaining stability and memory efficiency. Remember that the key to success is never sharing Nokogiri objects between threads and always implementing proper cleanup procedures.