Table of contents

How do I convert Nokogiri documents back to HTML strings?

Converting Nokogiri documents back to HTML strings is a common requirement when you need to serialize parsed HTML content for storage, transmission, or further processing. Nokogiri provides several methods to accomplish this task, each with different options and use cases.

Basic HTML String Conversion

The most straightforward way to convert a Nokogiri document to an HTML string is using the to_html method:

require 'nokogiri'

html = '<html><body><h1>Hello World</h1><p>This is a paragraph.</p></body></html>'
doc = Nokogiri::HTML(html)

# Convert entire document to HTML string
html_string = doc.to_html
puts html_string

This will output the complete HTML document, including the DOCTYPE declaration and any HTML structure that Nokogiri automatically adds.

Converting Specific Elements

You can also convert individual elements or node sets to HTML strings:

require 'nokogiri'

html = '<div><h1>Title</h1><p>Content</p><ul><li>Item 1</li><li>Item 2</li></ul></div>'
doc = Nokogiri::HTML::DocumentFragment.parse(html)

# Convert specific elements
title = doc.at_css('h1')
puts title.to_html  # Output: <h1>Title</h1>

# Convert multiple elements
list_items = doc.css('li')
list_items.each do |item|
  puts item.to_html
end

# Convert the entire fragment
puts doc.to_html

Serialization Options and Formatting

Nokogiri provides various options to control the HTML output format:

Pretty Printing

require 'nokogiri'

html = '<html><head><title>Test</title></head><body><div><p>Paragraph</p></div></body></html>'
doc = Nokogiri::HTML(html)

# Pretty print with indentation
formatted_html = doc.to_html(indent: 2)
puts formatted_html

Encoding Control

require 'nokogiri'

html = '<html><body><p>Hello 世界</p></body></html>'
doc = Nokogiri::HTML(html)

# Specify encoding
html_string = doc.to_html(encoding: 'UTF-8')
puts html_string

# Force ASCII encoding with entity encoding
ascii_html = doc.to_html(encoding: 'US-ASCII')
puts ascii_html

Save Options

You can use save_with options for more control over the output:

require 'nokogiri'

html = '<html><body><div>  <p>Text</p>  </div></body></html>'
doc = Nokogiri::HTML(html)

# Various save options
options = Nokogiri::XML::Node::SaveOptions::FORMAT | 
          Nokogiri::XML::Node::SaveOptions::NO_EMPTY_TAGS

formatted_html = doc.to_html(save_with: options, indent: 2)
puts formatted_html

Working with Document Fragments

When working with HTML fragments (partial HTML without a complete document structure), use DocumentFragment:

require 'nokogiri'

fragment_html = '<div class="container"><h2>Section Title</h2><p>Section content here.</p></div>'
fragment = Nokogiri::HTML::DocumentFragment.parse(fragment_html)

# Convert fragment to HTML string
output = fragment.to_html
puts output

# Modify and convert
fragment.at_css('h2').content = 'Updated Title'
modified_html = fragment.to_html
puts modified_html

Advanced Serialization Techniques

Custom Serialization with Builder

For complete control over HTML generation, you can use Nokogiri's Builder:

require 'nokogiri'

# Parse existing HTML
html = '<article><h1>Original Title</h1><p>Original content</p></article>'
doc = Nokogiri::HTML::DocumentFragment.parse(html)

# Extract data and rebuild with Builder
title = doc.at_css('h1').text
content = doc.at_css('p').text

builder = Nokogiri::HTML::Builder.new do |html|
  html.div(class: 'modernized-article') {
    html.header {
      html.h1(title, class: 'article-title')
    }
    html.main {
      html.p(content, class: 'article-content')
    }
  }
end

puts builder.to_html

Preserving Original Formatting

Sometimes you need to preserve the original HTML formatting as much as possible:

require 'nokogiri'

html = <<~HTML
  <div>
    <h1>Important Title</h1>
    <!-- This is a comment -->
    <p>Paragraph with <strong>bold</strong> text.</p>
  </div>
HTML

# Parse with comment preservation
doc = Nokogiri::HTML::DocumentFragment.parse(html)

# Convert back maintaining structure
preserved_html = doc.to_html
puts preserved_html

Handling Different Content Types

Converting XML to HTML

When working with XML documents that need to be converted to HTML:

require 'nokogiri'

xml = '<root><item>Content 1</item><item>Content 2</item></root>'
xml_doc = Nokogiri::XML(xml)

# Convert to HTML document fragment
html_fragment = Nokogiri::HTML::DocumentFragment.parse(xml_doc.to_html)
puts html_fragment.to_html

Text Content Extraction vs HTML Conversion

Understanding the difference between text extraction and HTML conversion:

require 'nokogiri'

html = '<div><p>This is <em>emphasized</em> text with <a href="#">a link</a>.</p></div>'
doc = Nokogiri::HTML::DocumentFragment.parse(html)

# Get HTML string (preserves markup)
html_output = doc.to_html
puts "HTML: #{html_output}"

# Get text content only (strips markup)
text_output = doc.text
puts "Text: #{text_output}"

# Get inner HTML of specific element
paragraph = doc.at_css('p')
inner_html = paragraph.inner_html
puts "Inner HTML: #{inner_html}"

Performance Considerations

When converting large documents or performing frequent conversions, consider these optimization strategies:

require 'nokogiri'
require 'benchmark'

# Sample large HTML content
large_html = '<div>' + ('<p>Sample paragraph content.</p>' * 1000) + '</div>'
doc = Nokogiri::HTML::DocumentFragment.parse(large_html)

# Benchmark different approaches
Benchmark.bm do |x|
  x.report("to_html:") { doc.to_html }
  x.report("to_html with options:") { doc.to_html(indent: 2) }
  x.report("inner_html:") { doc.children.map(&:to_html).join }
end

Error Handling and Edge Cases

Always handle potential errors when converting documents:

require 'nokogiri'

def safe_html_conversion(html_content)
  begin
    doc = Nokogiri::HTML::DocumentFragment.parse(html_content)
    return doc.to_html
  rescue Nokogiri::SyntaxError => e
    puts "Parse error: #{e.message}"
    return html_content # Return original if parsing fails
  rescue => e
    puts "Unexpected error: #{e.message}"
    return ""
  end
end

# Test with various inputs
test_cases = [
  '<div>Valid HTML</div>',
  '<div>Unclosed tag',
  '',
  nil
]

test_cases.each do |test_html|
  result = safe_html_conversion(test_html)
  puts "Input: #{test_html.inspect} -> Output: #{result.inspect}"
end

Integration with Web Scraping Workflows

When building web scraping applications, you might need to process scraped content and convert it back to HTML for storage or API responses. While tools like Puppeteer handle dynamic content extraction, Nokogiri excels at HTML manipulation and serialization tasks.

require 'nokogiri'
require 'net/http'

def scrape_and_process_html(url)
  # Fetch HTML content
  uri = URI(url)
  response = Net::HTTP.get_response(uri)

  return nil unless response.code == '200'

  # Parse with Nokogiri
  doc = Nokogiri::HTML(response.body)

  # Remove unwanted elements
  doc.css('script, style, nav, footer, .ads').remove

  # Extract main content
  main_content = doc.at_css('main, .content, article, .post')

  # Convert back to clean HTML string
  return main_content ? main_content.to_html : doc.to_html
rescue => e
  puts "Error processing #{url}: #{e.message}"
  return nil
end

Real-World Use Cases

Content Management Systems

require 'nokogiri'

def sanitize_user_content(user_html)
  doc = Nokogiri::HTML::DocumentFragment.parse(user_html)

  # Remove dangerous elements and attributes
  doc.css('script, iframe, object, embed').remove
  doc.css('*').each do |element|
    element.remove_attribute('onclick')
    element.remove_attribute('onload')
    element.remove_attribute('onerror')
  end

  # Return cleaned HTML
  doc.to_html
end

user_input = '<p>Safe content</p><script>alert("danger")</script>'
clean_html = sanitize_user_content(user_input)
puts clean_html  # Output: <p>Safe content</p>

Template Processing

require 'nokogiri'

def process_email_template(template_html, variables)
  doc = Nokogiri::HTML::DocumentFragment.parse(template_html)

  # Replace placeholders with actual values
  variables.each do |key, value|
    doc.css("[data-placeholder='#{key}']").each do |element|
      element.content = value
      element.remove_attribute('data-placeholder')
    end
  end

  doc.to_html
end

template = '<div><h1 data-placeholder="title">Title</h1><p data-placeholder="content">Content</p></div>'
variables = { 'title' => 'Welcome!', 'content' => 'Thank you for signing up.' }
processed = process_email_template(template, variables)
puts processed

Comparison with Other Serialization Methods

to_html vs to_s vs to_xml

require 'nokogiri'

html = '<div><p>Test content</p></div>'
doc = Nokogiri::HTML::DocumentFragment.parse(html)

puts "to_html: #{doc.to_html}"
puts "to_s: #{doc.to_s}"
puts "to_xml: #{doc.to_xml}"

Understanding when to use each method: - to_html: Best for HTML output with proper HTML formatting - to_s: Alias for to_html in most contexts - to_xml: Use when you need XML-compliant output

Conclusion

Converting Nokogiri documents back to HTML strings is straightforward with the to_html method, but Nokogiri provides extensive options for controlling the output format, encoding, and structure. Whether you're building a web scraper that needs to clean and reformat HTML content, or developing an application that manipulates HTML documents, understanding these conversion methods will help you generate the exact HTML output you need.

For complex web scraping scenarios involving dynamic content, consider combining Nokogiri's HTML manipulation capabilities with tools that can handle JavaScript-heavy websites and modern web applications for a complete scraping solution.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon