How do I handle malformed HTML with Nokogiri?

Nokogiri is designed to handle malformed HTML gracefully, making it an excellent choice for web scraping real-world websites that often contain imperfect markup. This guide covers various strategies and techniques for dealing with malformed HTML documents using Nokogiri's robust parsing capabilities.

Understanding Nokogiri's HTML Parser

Nokogiri uses the libxml2 library under the hood, which includes an HTML parser specifically designed to handle broken or malformed HTML. Unlike strict XML parsers, Nokogiri's HTML parser automatically corrects common HTML errors and creates a valid DOM tree.

Basic HTML Parsing with Error Tolerance

require 'nokogiri'

# Malformed HTML example
malformed_html = <<~HTML
  <html>
    <head>
      <title>Test Page
    </head>
    <body>
      <div>
        <p>Unclosed paragraph
        <span>Nested span</div>
      </div>
    </body>
  </html>
HTML

# Parse with default HTML parser (automatically handles errors)
doc = Nokogiri::HTML(malformed_html)
puts doc.title  # "Test Page"
puts doc.at_css('p').text  # "Unclosed paragraph"

Configuring Parser Options

Nokogiri provides several parsing options to control how malformed HTML is handled:

require 'nokogiri'

# Custom parsing options
doc = Nokogiri::HTML(malformed_html) do |config|
  config.options = Nokogiri::XML::ParseOptions::RECOVER |
                   Nokogiri::XML::ParseOptions::NOERROR |
                   Nokogiri::XML::ParseOptions::NOWARNING
end

# Alternative syntax
doc = Nokogiri::HTML::Document.parse(malformed_html, nil, nil, 
  Nokogiri::XML::ParseOptions::RECOVER)

Common Parse Options

RECOVER: Attempt to recover from parsing errors
NOERROR: Suppress error messages
NOWARNING: Suppress warning messages
HUGE: Allow documents larger than 256MB
COMPACT: Create a compact representation

Handling Specific Malformed HTML Issues

Missing or Mismatched Tags

# HTML with missing closing tags
broken_html = '<div><p>Text<span>More text</div>'

doc = Nokogiri::HTML(broken_html)
# Nokogiri automatically closes unclosed tags
puts doc.to_html
# Output: <div><p>Text<span>More text</span></p></div>

Invalid Nesting

# Invalid nesting (block element inside inline element)
invalid_nesting = '<span><div>This is wrong</div></span>'

doc = Nokogiri::HTML(invalid_nesting)
# Nokogiri restructures to valid HTML
puts doc.css('body').inner_html

Encoding Issues

# Handle encoding problems
def parse_with_encoding_detection(html_content)
  # Try UTF-8 first
  begin
    doc = Nokogiri::HTML(html_content, nil, 'UTF-8')
    return doc if doc.errors.empty?
  rescue
    # Fallback to auto-detection or specific encoding
  end

  # Try with encoding detection
  doc = Nokogiri::HTML(html_content, nil, nil) do |config|
    config.options = Nokogiri::XML::ParseOptions::RECOVER
  end

  doc
end

Error Detection and Handling

Checking for Parse Errors

doc = Nokogiri::HTML(malformed_html)

# Check if there were parsing errors
unless doc.errors.empty?
  puts "Parse errors found:"
  doc.errors.each do |error|
    puts "Line #{error.line}: #{error.message}"
  end
end

# Get error details
doc.errors.each do |error|
  puts "Level: #{error.level}"      # 1=warning, 2=error, 3=fatal
  puts "Code: #{error.code}"        # Error code
  puts "Domain: #{error.domain}"    # Parser domain
  puts "Message: #{error.message}"  # Error description
  puts "Line: #{error.line}"        # Line number
  puts "Column: #{error.column}"    # Column number
end

Custom Error Handling

class HTMLCleaner
  def self.parse_and_clean(html_content)
    doc = Nokogiri::HTML(html_content) do |config|
      config.options = Nokogiri::XML::ParseOptions::RECOVER |
                       Nokogiri::XML::ParseOptions::NOERROR
    end

    # Additional cleanup
    clean_document(doc)
  end

  private

  def self.clean_document(doc)
    # Remove empty elements
    doc.css('*').each do |element|
      element.remove if element.content.strip.empty? && element.children.empty?
    end

    # Fix common attribute issues
    doc.css('[src], [href]').each do |element|
      %w[src href].each do |attr|
        if element[attr] && element[attr].strip.empty?
          element.remove_attribute(attr)
        end
      end
    end

    doc
  end
end

Advanced Malformed HTML Scenarios

Handling Multiple Root Elements

# HTML with multiple root elements (invalid)
multiple_roots = '<div>First</div><div>Second</div><span>Third</span>'

doc = Nokogiri::HTML(multiple_roots)
# Nokogiri wraps in proper html/body structure
puts doc.css('body > div, body > span').count  # 3

Processing Fragmented HTML

# Parse HTML fragments
fragment_html = '<td>Cell 1</td><td>Cell 2</td>'

# Use HTML fragment parsing
fragment = Nokogiri::HTML::DocumentFragment.parse(fragment_html)
puts fragment.css('td').count  # 2

# Or parse as complete document (adds html/body wrapper)
doc = Nokogiri::HTML(fragment_html)
puts doc.css('td').count  # 2

Dealing with JavaScript-Generated Content

While Nokogiri can't execute JavaScript, you can clean up HTML that contains JavaScript artifacts:

def clean_js_artifacts(html)
  doc = Nokogiri::HTML(html)

  # Remove script tags
  doc.css('script').remove

  # Remove HTML comments that might contain JS
  doc.xpath('//comment()').remove

  # Clean up onclick and other JS event attributes
  doc.css('*').each do |element|
    element.attributes.each do |name, attr|
      if name.start_with?('on') || name == 'javascript:'
        element.remove_attribute(name)
      end
    end
  end

  doc
end

Sanitization and Security

HTML Sanitization

require 'nokogiri'

class HTMLSanitizer
  ALLOWED_TAGS = %w[p br strong em ul ol li h1 h2 h3 h4 h5 h6].freeze
  ALLOWED_ATTRIBUTES = %w[class id].freeze

  def self.sanitize(html)
    doc = Nokogiri::HTML::DocumentFragment.parse(html)

    doc.css('*').each do |element|
      # Remove disallowed tags
      unless ALLOWED_TAGS.include?(element.name.downcase)
        element.remove
        next
      end

      # Remove disallowed attributes
      element.attributes.each do |name, attr|
        unless ALLOWED_ATTRIBUTES.include?(name.downcase)
          element.remove_attribute(name)
        end
      end
    end

    doc.to_html
  end
end

# Usage
dirty_html = '<div onclick="alert()"><p>Safe content</p><script>alert("xss")</script></div>'
clean_html = HTMLSanitizer.sanitize(dirty_html)
puts clean_html  # <p>Safe content</p>

Performance Considerations

Efficient Parsing for Large Documents

def parse_large_malformed_html(html_content)
  # Use streaming parser for very large documents
  doc = Nokogiri::HTML(html_content) do |config|
    config.options = Nokogiri::XML::ParseOptions::RECOVER |
                     Nokogiri::XML::ParseOptions::HUGE |
                     Nokogiri::XML::ParseOptions::COMPACT
  end

  doc
end

# Memory-efficient processing
def process_in_chunks(html_content)
  # Split large HTML into manageable chunks
  chunks = html_content.scan(/.{1,50000}/m)

  chunks.map do |chunk|
    Nokogiri::HTML::DocumentFragment.parse(chunk)
  end
end

Integration with Web Scraping APIs

When dealing with complex JavaScript-heavy sites that generate malformed HTML, you might encounter scenarios where traditional parsing isn't sufficient. In such cases, using a web scraping API that handles JavaScript rendering can provide you with properly rendered HTML that Nokogiri can then parse more reliably.

For sites with particularly challenging content structures, consider preprocessing the HTML with automated tools before applying Nokogiri's parsing capabilities. This approach is especially useful when handling dynamic content that loads after page initialization.

Real-World Examples

Scraping E-commerce Sites

require 'open-uri'
require 'nokogiri'

def scrape_product_info(url)
  html = URI.open(url).read
  doc = Nokogiri::HTML(html) do |config|
    config.options = Nokogiri::XML::ParseOptions::RECOVER
  end

  # Handle missing or malformed product data
  {
    title: extract_safe_text(doc, '.product-title, h1'),
    price: extract_safe_text(doc, '.price, .cost'),
    description: extract_safe_text(doc, '.description, .product-desc')
  }
end

def extract_safe_text(doc, selector)
  element = doc.at_css(selector)
  element ? element.text.strip : 'Not found'
rescue => e
  "Error: #{e.message}"
end

Cleaning User-Generated Content

def clean_user_html(user_input)
  doc = Nokogiri::HTML::DocumentFragment.parse(user_input)

  # Remove dangerous elements
  doc.css('script, object, embed, iframe').remove

  # Clean attributes
  doc.css('*').each do |element|
    element.attributes.each do |name, attr|
      if name.start_with?('on') || attr.value.include?('javascript:')
        element.remove_attribute(name)
      end
    end
  end

  doc.to_html
end

Testing with Malformed HTML

require 'rspec'
require 'nokogiri'

RSpec.describe 'Malformed HTML parsing' do
  it 'handles unclosed tags gracefully' do
    html = '<div><p>Text<span>More text</div>'
    doc = Nokogiri::HTML(html)

    expect(doc.css('div').size).to eq(1)
    expect(doc.css('p').size).to eq(1)
    expect(doc.css('span').size).to eq(1)
  end

  it 'recovers from invalid nesting' do
    html = '<span><div>Invalid nesting</div></span>'
    doc = Nokogiri::HTML(html)

    # Nokogiri should restructure this appropriately
    expect(doc.errors).to be_empty
    expect(doc.css('div').first.text).to include('Invalid nesting')
  end
end

Best Practices

Always Use Error Recovery: Enable the RECOVER option for real-world HTML
Validate Critical Data: Check for expected elements before accessing them
Handle Encoding Properly: Specify encoding when possible, let Nokogiri auto-detect when uncertain
Sanitize User Input: Always sanitize HTML from untrusted sources
Test with Real Data: Test your parsing logic with actual malformed HTML from target websites
Use Defensive Programming: Implement safe text extraction methods with error handling
Monitor Parse Errors: Log parsing errors in production to identify problematic sources

Common Malformed HTML Patterns

# Common patterns and how Nokogiri handles them
patterns = {
  'Unclosed divs' => '<div><p>Content</div>',
  'Wrong nesting' => '<em><p>Emphasis in paragraph</p></em>',
  'Missing quotes' => '<img src=image.jpg alt=description>',
  'Self-closing non-void' => '<div />',
  'Multiple roots' => '<div>First</div><div>Second</div>'
}

patterns.each do |description, html|
  doc = Nokogiri::HTML(html)
  puts "#{description}:"
  puts "  Original: #{html}"
  puts "  Parsed: #{doc.css('body').inner_html.strip}"
  puts "  Errors: #{doc.errors.any? ? doc.errors.first.message : 'None'}"
  puts
end

Conclusion

Nokogiri's robust HTML parser makes handling malformed HTML straightforward in most cases. By understanding the available parsing options, implementing proper error handling, and following security best practices, you can reliably extract data from even the most poorly formatted web pages. Remember to always test your parsing logic with real-world data and implement appropriate fallbacks for critical data extraction scenarios.

The key to successful malformed HTML handling is combining Nokogiri's built-in recovery capabilities with defensive programming practices and thorough testing. This approach ensures your scraping applications remain robust when encountering the inevitable HTML quality issues found across the web.

Table of contents