How do I handle malformed HTML with Nokogiri?
Nokogiri is designed to handle malformed HTML gracefully, making it an excellent choice for web scraping real-world websites that often contain imperfect markup. This guide covers various strategies and techniques for dealing with malformed HTML documents using Nokogiri's robust parsing capabilities.
Understanding Nokogiri's HTML Parser
Nokogiri uses the libxml2 library under the hood, which includes an HTML parser specifically designed to handle broken or malformed HTML. Unlike strict XML parsers, Nokogiri's HTML parser automatically corrects common HTML errors and creates a valid DOM tree.
Basic HTML Parsing with Error Tolerance
require 'nokogiri'
# Malformed HTML example
malformed_html = <<~HTML
<html>
<head>
<title>Test Page
</head>
<body>
<div>
<p>Unclosed paragraph
<span>Nested span</div>
</div>
</body>
</html>
HTML
# Parse with default HTML parser (automatically handles errors)
doc = Nokogiri::HTML(malformed_html)
puts doc.title # "Test Page"
puts doc.at_css('p').text # "Unclosed paragraph"
Configuring Parser Options
Nokogiri provides several parsing options to control how malformed HTML is handled:
require 'nokogiri'
# Custom parsing options
doc = Nokogiri::HTML(malformed_html) do |config|
config.options = Nokogiri::XML::ParseOptions::RECOVER |
Nokogiri::XML::ParseOptions::NOERROR |
Nokogiri::XML::ParseOptions::NOWARNING
end
# Alternative syntax
doc = Nokogiri::HTML::Document.parse(malformed_html, nil, nil,
Nokogiri::XML::ParseOptions::RECOVER)
Common Parse Options
RECOVER
: Attempt to recover from parsing errorsNOERROR
: Suppress error messagesNOWARNING
: Suppress warning messagesHUGE
: Allow documents larger than 256MBCOMPACT
: Create a compact representation
Handling Specific Malformed HTML Issues
Missing or Mismatched Tags
# HTML with missing closing tags
broken_html = '<div><p>Text<span>More text</div>'
doc = Nokogiri::HTML(broken_html)
# Nokogiri automatically closes unclosed tags
puts doc.to_html
# Output: <div><p>Text<span>More text</span></p></div>
Invalid Nesting
# Invalid nesting (block element inside inline element)
invalid_nesting = '<span><div>This is wrong</div></span>'
doc = Nokogiri::HTML(invalid_nesting)
# Nokogiri restructures to valid HTML
puts doc.css('body').inner_html
Encoding Issues
# Handle encoding problems
def parse_with_encoding_detection(html_content)
# Try UTF-8 first
begin
doc = Nokogiri::HTML(html_content, nil, 'UTF-8')
return doc if doc.errors.empty?
rescue
# Fallback to auto-detection or specific encoding
end
# Try with encoding detection
doc = Nokogiri::HTML(html_content, nil, nil) do |config|
config.options = Nokogiri::XML::ParseOptions::RECOVER
end
doc
end
Error Detection and Handling
Checking for Parse Errors
doc = Nokogiri::HTML(malformed_html)
# Check if there were parsing errors
unless doc.errors.empty?
puts "Parse errors found:"
doc.errors.each do |error|
puts "Line #{error.line}: #{error.message}"
end
end
# Get error details
doc.errors.each do |error|
puts "Level: #{error.level}" # 1=warning, 2=error, 3=fatal
puts "Code: #{error.code}" # Error code
puts "Domain: #{error.domain}" # Parser domain
puts "Message: #{error.message}" # Error description
puts "Line: #{error.line}" # Line number
puts "Column: #{error.column}" # Column number
end
Custom Error Handling
class HTMLCleaner
def self.parse_and_clean(html_content)
doc = Nokogiri::HTML(html_content) do |config|
config.options = Nokogiri::XML::ParseOptions::RECOVER |
Nokogiri::XML::ParseOptions::NOERROR
end
# Additional cleanup
clean_document(doc)
end
private
def self.clean_document(doc)
# Remove empty elements
doc.css('*').each do |element|
element.remove if element.content.strip.empty? && element.children.empty?
end
# Fix common attribute issues
doc.css('[src], [href]').each do |element|
%w[src href].each do |attr|
if element[attr] && element[attr].strip.empty?
element.remove_attribute(attr)
end
end
end
doc
end
end
Advanced Malformed HTML Scenarios
Handling Multiple Root Elements
# HTML with multiple root elements (invalid)
multiple_roots = '<div>First</div><div>Second</div><span>Third</span>'
doc = Nokogiri::HTML(multiple_roots)
# Nokogiri wraps in proper html/body structure
puts doc.css('body > div, body > span').count # 3
Processing Fragmented HTML
# Parse HTML fragments
fragment_html = '<td>Cell 1</td><td>Cell 2</td>'
# Use HTML fragment parsing
fragment = Nokogiri::HTML::DocumentFragment.parse(fragment_html)
puts fragment.css('td').count # 2
# Or parse as complete document (adds html/body wrapper)
doc = Nokogiri::HTML(fragment_html)
puts doc.css('td').count # 2
Dealing with JavaScript-Generated Content
While Nokogiri can't execute JavaScript, you can clean up HTML that contains JavaScript artifacts:
def clean_js_artifacts(html)
doc = Nokogiri::HTML(html)
# Remove script tags
doc.css('script').remove
# Remove HTML comments that might contain JS
doc.xpath('//comment()').remove
# Clean up onclick and other JS event attributes
doc.css('*').each do |element|
element.attributes.each do |name, attr|
if name.start_with?('on') || name == 'javascript:'
element.remove_attribute(name)
end
end
end
doc
end
Sanitization and Security
HTML Sanitization
require 'nokogiri'
class HTMLSanitizer
ALLOWED_TAGS = %w[p br strong em ul ol li h1 h2 h3 h4 h5 h6].freeze
ALLOWED_ATTRIBUTES = %w[class id].freeze
def self.sanitize(html)
doc = Nokogiri::HTML::DocumentFragment.parse(html)
doc.css('*').each do |element|
# Remove disallowed tags
unless ALLOWED_TAGS.include?(element.name.downcase)
element.remove
next
end
# Remove disallowed attributes
element.attributes.each do |name, attr|
unless ALLOWED_ATTRIBUTES.include?(name.downcase)
element.remove_attribute(name)
end
end
end
doc.to_html
end
end
# Usage
dirty_html = '<div onclick="alert()"><p>Safe content</p><script>alert("xss")</script></div>'
clean_html = HTMLSanitizer.sanitize(dirty_html)
puts clean_html # <p>Safe content</p>
Performance Considerations
Efficient Parsing for Large Documents
def parse_large_malformed_html(html_content)
# Use streaming parser for very large documents
doc = Nokogiri::HTML(html_content) do |config|
config.options = Nokogiri::XML::ParseOptions::RECOVER |
Nokogiri::XML::ParseOptions::HUGE |
Nokogiri::XML::ParseOptions::COMPACT
end
doc
end
# Memory-efficient processing
def process_in_chunks(html_content)
# Split large HTML into manageable chunks
chunks = html_content.scan(/.{1,50000}/m)
chunks.map do |chunk|
Nokogiri::HTML::DocumentFragment.parse(chunk)
end
end
Integration with Web Scraping APIs
When dealing with complex JavaScript-heavy sites that generate malformed HTML, you might encounter scenarios where traditional parsing isn't sufficient. In such cases, using a web scraping API that handles JavaScript rendering can provide you with properly rendered HTML that Nokogiri can then parse more reliably.
For sites with particularly challenging content structures, consider preprocessing the HTML with automated tools before applying Nokogiri's parsing capabilities. This approach is especially useful when handling dynamic content that loads after page initialization.
Real-World Examples
Scraping E-commerce Sites
require 'open-uri'
require 'nokogiri'
def scrape_product_info(url)
html = URI.open(url).read
doc = Nokogiri::HTML(html) do |config|
config.options = Nokogiri::XML::ParseOptions::RECOVER
end
# Handle missing or malformed product data
{
title: extract_safe_text(doc, '.product-title, h1'),
price: extract_safe_text(doc, '.price, .cost'),
description: extract_safe_text(doc, '.description, .product-desc')
}
end
def extract_safe_text(doc, selector)
element = doc.at_css(selector)
element ? element.text.strip : 'Not found'
rescue => e
"Error: #{e.message}"
end
Cleaning User-Generated Content
def clean_user_html(user_input)
doc = Nokogiri::HTML::DocumentFragment.parse(user_input)
# Remove dangerous elements
doc.css('script, object, embed, iframe').remove
# Clean attributes
doc.css('*').each do |element|
element.attributes.each do |name, attr|
if name.start_with?('on') || attr.value.include?('javascript:')
element.remove_attribute(name)
end
end
end
doc.to_html
end
Testing with Malformed HTML
require 'rspec'
require 'nokogiri'
RSpec.describe 'Malformed HTML parsing' do
it 'handles unclosed tags gracefully' do
html = '<div><p>Text<span>More text</div>'
doc = Nokogiri::HTML(html)
expect(doc.css('div').size).to eq(1)
expect(doc.css('p').size).to eq(1)
expect(doc.css('span').size).to eq(1)
end
it 'recovers from invalid nesting' do
html = '<span><div>Invalid nesting</div></span>'
doc = Nokogiri::HTML(html)
# Nokogiri should restructure this appropriately
expect(doc.errors).to be_empty
expect(doc.css('div').first.text).to include('Invalid nesting')
end
end
Best Practices
- Always Use Error Recovery: Enable the
RECOVER
option for real-world HTML - Validate Critical Data: Check for expected elements before accessing them
- Handle Encoding Properly: Specify encoding when possible, let Nokogiri auto-detect when uncertain
- Sanitize User Input: Always sanitize HTML from untrusted sources
- Test with Real Data: Test your parsing logic with actual malformed HTML from target websites
- Use Defensive Programming: Implement safe text extraction methods with error handling
- Monitor Parse Errors: Log parsing errors in production to identify problematic sources
Common Malformed HTML Patterns
# Common patterns and how Nokogiri handles them
patterns = {
'Unclosed divs' => '<div><p>Content</div>',
'Wrong nesting' => '<em><p>Emphasis in paragraph</p></em>',
'Missing quotes' => '<img src=image.jpg alt=description>',
'Self-closing non-void' => '<div />',
'Multiple roots' => '<div>First</div><div>Second</div>'
}
patterns.each do |description, html|
doc = Nokogiri::HTML(html)
puts "#{description}:"
puts " Original: #{html}"
puts " Parsed: #{doc.css('body').inner_html.strip}"
puts " Errors: #{doc.errors.any? ? doc.errors.first.message : 'None'}"
puts
end
Conclusion
Nokogiri's robust HTML parser makes handling malformed HTML straightforward in most cases. By understanding the available parsing options, implementing proper error handling, and following security best practices, you can reliably extract data from even the most poorly formatted web pages. Remember to always test your parsing logic with real-world data and implement appropriate fallbacks for critical data extraction scenarios.
The key to successful malformed HTML handling is combining Nokogiri's built-in recovery capabilities with defensive programming practices and thorough testing. This approach ensures your scraping applications remain robust when encountering the inevitable HTML quality issues found across the web.