What are the Security Considerations When Using Nokogiri with Untrusted HTML?
When working with Nokogiri to parse HTML from untrusted sources, security should be your top priority. Nokogiri, while powerful and widely used in Ruby applications, can expose your application to various security vulnerabilities if not configured properly. This comprehensive guide covers the essential security considerations and best practices for safely parsing untrusted HTML content.
Understanding the Security Landscape
Parsing untrusted HTML introduces several attack vectors that malicious actors can exploit. The primary security concerns include XML External Entity (XXE) attacks, script injection, denial of service through malformed markup, and potential memory exhaustion attacks. Understanding these threats is crucial for implementing proper security measures.
XML External Entity (XXE) Attacks
XXE attacks are among the most serious security vulnerabilities when parsing XML or HTML content. These attacks occur when an XML parser processes external entity references, potentially allowing attackers to read local files, perform network requests, or cause denial of service.
Disabling External Entity Processing
The most effective defense against XXE attacks is to disable external entity processing entirely:
require 'nokogiri'
# Secure configuration - disable external entities
doc = Nokogiri::HTML::Document.parse(untrusted_html) do |config|
config.noent # Don't substitute entities
config.nonet # Don't allow network connections
config.noblanks # Remove blank nodes
config.strict # Strict parsing mode
end
For XML parsing, use similar configuration:
# Secure XML parsing
xml_doc = Nokogiri::XML::Document.parse(untrusted_xml) do |config|
config.noent
config.nonet
config.strict
end
Custom Parser Configuration
Create a reusable secure parser configuration:
class SecureNokogiriParser
def self.parse_html(html_content)
Nokogiri::HTML::Document.parse(html_content) do |config|
config.noent
config.nonet
config.noblanks
config.strict
end
rescue Nokogiri::XML::SyntaxError => e
Rails.logger.warn "Failed to parse HTML: #{e.message}"
nil
end
end
# Usage
doc = SecureNokogiriParser.parse_html(untrusted_html)
Script Injection and XSS Prevention
When extracting content from parsed HTML, be vigilant about potential script injection attacks. Even with secure parsing, the extracted content might contain malicious scripts.
Safe Content Extraction
Always sanitize extracted content before displaying it:
require 'sanitize'
def extract_safe_content(html)
doc = Nokogiri::HTML::Document.parse(html) do |config|
config.noent
config.nonet
end
# Extract text content only
text_content = doc.text
# Or use Sanitize for HTML content
html_content = Sanitize.fragment(doc.to_html, Sanitize::Config::RELAXED)
return text_content, html_content
end
Removing Dangerous Elements
Strip potentially dangerous elements and attributes:
def sanitize_html_document(html)
doc = Nokogiri::HTML::Document.parse(html) do |config|
config.noent
config.nonet
end
# Remove script tags
doc.search('script').remove
# Remove dangerous attributes
doc.search('*').each do |element|
element.attributes.each do |name, attr|
if name.downcase.start_with?('on') # onclick, onload, etc.
attr.remove
end
end
end
doc
end
Memory and Resource Management
Untrusted HTML can be crafted to consume excessive memory or processing time, leading to denial of service attacks.
Implementing Size Limits
Set strict limits on input size:
class SafeHtmlParser
MAX_HTML_SIZE = 1.megabyte
MAX_PARSE_TIME = 10.seconds
def self.parse_with_limits(html_content)
# Check size limit
raise SecurityError, "HTML too large" if html_content.bytesize > MAX_HTML_SIZE
# Parse with timeout
Timeout.timeout(MAX_PARSE_TIME) do
Nokogiri::HTML::Document.parse(html_content) do |config|
config.noent
config.nonet
config.strict
end
end
rescue Timeout::Error
raise SecurityError, "HTML parsing timeout"
end
end
Monitoring Resource Usage
Implement monitoring for suspicious parsing patterns:
class HtmlParsingMonitor
def self.monitored_parse(html_content, source: 'unknown')
start_time = Time.current
memory_before = get_memory_usage
doc = Nokogiri::HTML::Document.parse(html_content) do |config|
config.noent
config.nonet
end
memory_after = get_memory_usage
parse_time = Time.current - start_time
# Log suspicious activity
if parse_time > 5.seconds || (memory_after - memory_before) > 50.megabytes
Rails.logger.warn "Suspicious HTML parsing: source=#{source}, time=#{parse_time}, memory=#{memory_after - memory_before}"
end
doc
end
private
def self.get_memory_usage
`ps -o rss= -p #{Process.pid}`.to_i.kilobytes
end
end
Input Validation and Preprocessing
Validate and preprocess HTML content before parsing to catch obvious malicious patterns.
Content Validation
class HtmlValidator
SUSPICIOUS_PATTERNS = [
/<!ENTITY/i, # XML entities
/<!DOCTYPE.*\[/i, # DOCTYPE with internal subset
/SYSTEM\s+["'][^"']*["']/i, # External system references
/<\?xml/i, # XML processing instructions
/javascript:/i, # JavaScript URLs
/data:/i, # Data URLs
/vbscript:/i # VBScript URLs
].freeze
def self.validate_html(html_content)
SUSPICIOUS_PATTERNS.each do |pattern|
if html_content.match?(pattern)
raise SecurityError, "Potentially malicious HTML detected: #{pattern}"
end
end
true
end
end
# Usage
begin
HtmlValidator.validate_html(untrusted_html)
doc = Nokogiri::HTML::Document.parse(untrusted_html) do |config|
config.noent
config.nonet
end
rescue SecurityError => e
Rails.logger.error "Security violation: #{e.message}"
return nil
end
Error Handling and Logging
Proper error handling is crucial for security and debugging when dealing with potentially malicious content.
Comprehensive Error Handling
class SecureHtmlProcessor
def self.process_untrusted_html(html_content, options = {})
return nil if html_content.blank?
begin
# Validate input
HtmlValidator.validate_html(html_content)
# Parse securely
doc = Nokogiri::HTML::Document.parse(html_content) do |config|
config.noent
config.nonet
config.noblanks
config.strict
end
# Process and return safe content
process_document(doc, options)
rescue Nokogiri::XML::SyntaxError => e
Rails.logger.info "HTML syntax error: #{e.message}"
nil
rescue SecurityError => e
Rails.logger.error "Security violation in HTML processing: #{e.message}"
nil
rescue StandardError => e
Rails.logger.error "Unexpected error in HTML processing: #{e.class} - #{e.message}"
nil
end
end
private
def self.process_document(doc, options)
# Your document processing logic here
# Always sanitize output
extracted_content = doc.css(options[:selector] || 'body').text
extracted_content.strip
end
end
Production Deployment Considerations
When deploying applications that parse untrusted HTML, additional security measures should be implemented.
Environment Configuration
Set secure defaults in your production environment:
# config/application.rb
config.nokogiri = {
noent: true,
nonet: true,
strict: true
}
# Disable XML external entity processing globally
Nokogiri::XML::Document.default_parse_options =
Nokogiri::XML::ParseOptions::NOENT |
Nokogiri::XML::ParseOptions::NONET
Rate Limiting and Monitoring
Implement rate limiting for HTML processing endpoints:
# Using rack-attack or similar
class Rack::Attack
throttle('html_processing', limit: 10, period: 1.minute) do |request|
request.ip if request.path.start_with?('/api/html/')
end
end
Sandboxing and Isolation
For high-security environments, consider running HTML parsing in isolated environments.
Container-Based Isolation
Use Docker containers to isolate parsing operations:
# Dockerfile for secure parsing service
FROM ruby:3.1-alpine
RUN adduser -D -s /bin/sh parser
USER parser
WORKDIR /app
COPY --chown=parser:parser . .
RUN bundle install --without development test
CMD ["ruby", "secure_parser.rb"]
Process Isolation
Run parsing in separate processes with limited privileges:
class IsolatedHtmlParser
def self.parse_in_subprocess(html_content)
read_pipe, write_pipe = IO.pipe
pid = fork do
read_pipe.close
# Drop privileges
Process::UID.change_privilege(1000)
Process::GID.change_privilege(1000)
begin
result = SecureHtmlProcessor.process_untrusted_html(html_content)
Marshal.dump(result, write_pipe)
rescue => e
Marshal.dump({ error: e.message }, write_pipe)
ensure
write_pipe.close
exit
end
end
write_pipe.close
result = Marshal.load(read_pipe)
Process.wait(pid)
read_pipe.close
result
end
end
Testing Security Measures
Regular security testing ensures your Nokogiri implementation remains secure.
Security Test Cases
# spec/security/nokogiri_security_spec.rb
RSpec.describe 'Nokogiri Security' do
describe 'XXE attack prevention' do
it 'rejects XML with external entities' do
malicious_html = '<!DOCTYPE html [<!ENTITY xxe SYSTEM "file:///etc/passwd">]><html>&xxe;</html>'
expect {
SecureHtmlProcessor.process_untrusted_html(malicious_html)
}.not_to raise_error
# Should not contain file contents
result = SecureHtmlProcessor.process_untrusted_html(malicious_html)
expect(result).not_to include('root:')
end
end
describe 'resource exhaustion prevention' do
it 'handles extremely nested HTML' do
nested_html = '<div>' * 10000 + 'content' + '</div>' * 10000
expect {
Timeout.timeout(5) do
SecureHtmlProcessor.process_untrusted_html(nested_html)
end
}.not_to raise_error(Timeout::Error)
end
end
describe 'script injection prevention' do
it 'removes dangerous script tags' do
malicious_html = '<div><script>alert("XSS")</script>Safe content</div>'
result = SecureHtmlProcessor.process_untrusted_html(malicious_html)
expect(result).to eq('Safe content')
expect(result).not_to include('alert')
end
end
end
Performance Considerations
Security measures can impact performance, so it's important to balance security and efficiency.
Caching Validated Content
class CachedSecureParser
def self.parse_with_cache(html_content)
cache_key = Digest::SHA256.hexdigest(html_content)
Rails.cache.fetch("secure_html_#{cache_key}", expires_in: 1.hour) do
SecureHtmlProcessor.process_untrusted_html(html_content)
end
end
end
Batch Processing
Process multiple HTML documents efficiently:
class BatchHtmlProcessor
def self.process_batch(html_documents)
results = []
html_documents.each_slice(10) do |batch|
batch_results = batch.map do |html|
SecureHtmlProcessor.process_untrusted_html(html)
end
results.concat(batch_results)
# Allow other processes to run
sleep(0.01) if batch_results.size == 10
end
results
end
end
Conclusion
Securing Nokogiri when processing untrusted HTML requires a multi-layered approach combining proper parser configuration, input validation, resource management, and comprehensive error handling. The key principles are to disable external entity processing, validate and sanitize all input, implement strict resource limits, and maintain thorough logging and monitoring.
When building applications that process HTML from external sources, consider using additional security tools and regularly update your dependencies. For complex scraping scenarios involving dynamic content, you might also want to explore how to handle authentication in Puppeteer for more sophisticated security challenges, or learn about handling errors in Puppeteer for robust error management.
Remember that security is an ongoing process, and staying informed about new vulnerabilities and best practices is essential for maintaining a secure application. Regular security audits and penetration testing should be part of your development lifecycle when handling untrusted content.