How do I convert Nokogiri documents back to HTML strings?
Converting Nokogiri documents back to HTML strings is a common requirement when you need to serialize parsed HTML content for storage, transmission, or further processing. Nokogiri provides several methods to accomplish this task, each with different options and use cases.
Basic HTML String Conversion
The most straightforward way to convert a Nokogiri document to an HTML string is using the to_html
method:
require 'nokogiri'
html = '<html><body><h1>Hello World</h1><p>This is a paragraph.</p></body></html>'
doc = Nokogiri::HTML(html)
# Convert entire document to HTML string
html_string = doc.to_html
puts html_string
This will output the complete HTML document, including the DOCTYPE declaration and any HTML structure that Nokogiri automatically adds.
Converting Specific Elements
You can also convert individual elements or node sets to HTML strings:
require 'nokogiri'
html = '<div><h1>Title</h1><p>Content</p><ul><li>Item 1</li><li>Item 2</li></ul></div>'
doc = Nokogiri::HTML::DocumentFragment.parse(html)
# Convert specific elements
title = doc.at_css('h1')
puts title.to_html # Output: <h1>Title</h1>
# Convert multiple elements
list_items = doc.css('li')
list_items.each do |item|
puts item.to_html
end
# Convert the entire fragment
puts doc.to_html
Serialization Options and Formatting
Nokogiri provides various options to control the HTML output format:
Pretty Printing
require 'nokogiri'
html = '<html><head><title>Test</title></head><body><div><p>Paragraph</p></div></body></html>'
doc = Nokogiri::HTML(html)
# Pretty print with indentation
formatted_html = doc.to_html(indent: 2)
puts formatted_html
Encoding Control
require 'nokogiri'
html = '<html><body><p>Hello 世界</p></body></html>'
doc = Nokogiri::HTML(html)
# Specify encoding
html_string = doc.to_html(encoding: 'UTF-8')
puts html_string
# Force ASCII encoding with entity encoding
ascii_html = doc.to_html(encoding: 'US-ASCII')
puts ascii_html
Save Options
You can use save_with
options for more control over the output:
require 'nokogiri'
html = '<html><body><div> <p>Text</p> </div></body></html>'
doc = Nokogiri::HTML(html)
# Various save options
options = Nokogiri::XML::Node::SaveOptions::FORMAT |
Nokogiri::XML::Node::SaveOptions::NO_EMPTY_TAGS
formatted_html = doc.to_html(save_with: options, indent: 2)
puts formatted_html
Working with Document Fragments
When working with HTML fragments (partial HTML without a complete document structure), use DocumentFragment
:
require 'nokogiri'
fragment_html = '<div class="container"><h2>Section Title</h2><p>Section content here.</p></div>'
fragment = Nokogiri::HTML::DocumentFragment.parse(fragment_html)
# Convert fragment to HTML string
output = fragment.to_html
puts output
# Modify and convert
fragment.at_css('h2').content = 'Updated Title'
modified_html = fragment.to_html
puts modified_html
Advanced Serialization Techniques
Custom Serialization with Builder
For complete control over HTML generation, you can use Nokogiri's Builder:
require 'nokogiri'
# Parse existing HTML
html = '<article><h1>Original Title</h1><p>Original content</p></article>'
doc = Nokogiri::HTML::DocumentFragment.parse(html)
# Extract data and rebuild with Builder
title = doc.at_css('h1').text
content = doc.at_css('p').text
builder = Nokogiri::HTML::Builder.new do |html|
html.div(class: 'modernized-article') {
html.header {
html.h1(title, class: 'article-title')
}
html.main {
html.p(content, class: 'article-content')
}
}
end
puts builder.to_html
Preserving Original Formatting
Sometimes you need to preserve the original HTML formatting as much as possible:
require 'nokogiri'
html = <<~HTML
<div>
<h1>Important Title</h1>
<!-- This is a comment -->
<p>Paragraph with <strong>bold</strong> text.</p>
</div>
HTML
# Parse with comment preservation
doc = Nokogiri::HTML::DocumentFragment.parse(html)
# Convert back maintaining structure
preserved_html = doc.to_html
puts preserved_html
Handling Different Content Types
Converting XML to HTML
When working with XML documents that need to be converted to HTML:
require 'nokogiri'
xml = '<root><item>Content 1</item><item>Content 2</item></root>'
xml_doc = Nokogiri::XML(xml)
# Convert to HTML document fragment
html_fragment = Nokogiri::HTML::DocumentFragment.parse(xml_doc.to_html)
puts html_fragment.to_html
Text Content Extraction vs HTML Conversion
Understanding the difference between text extraction and HTML conversion:
require 'nokogiri'
html = '<div><p>This is <em>emphasized</em> text with <a href="#">a link</a>.</p></div>'
doc = Nokogiri::HTML::DocumentFragment.parse(html)
# Get HTML string (preserves markup)
html_output = doc.to_html
puts "HTML: #{html_output}"
# Get text content only (strips markup)
text_output = doc.text
puts "Text: #{text_output}"
# Get inner HTML of specific element
paragraph = doc.at_css('p')
inner_html = paragraph.inner_html
puts "Inner HTML: #{inner_html}"
Performance Considerations
When converting large documents or performing frequent conversions, consider these optimization strategies:
require 'nokogiri'
require 'benchmark'
# Sample large HTML content
large_html = '<div>' + ('<p>Sample paragraph content.</p>' * 1000) + '</div>'
doc = Nokogiri::HTML::DocumentFragment.parse(large_html)
# Benchmark different approaches
Benchmark.bm do |x|
x.report("to_html:") { doc.to_html }
x.report("to_html with options:") { doc.to_html(indent: 2) }
x.report("inner_html:") { doc.children.map(&:to_html).join }
end
Error Handling and Edge Cases
Always handle potential errors when converting documents:
require 'nokogiri'
def safe_html_conversion(html_content)
begin
doc = Nokogiri::HTML::DocumentFragment.parse(html_content)
return doc.to_html
rescue Nokogiri::SyntaxError => e
puts "Parse error: #{e.message}"
return html_content # Return original if parsing fails
rescue => e
puts "Unexpected error: #{e.message}"
return ""
end
end
# Test with various inputs
test_cases = [
'<div>Valid HTML</div>',
'<div>Unclosed tag',
'',
nil
]
test_cases.each do |test_html|
result = safe_html_conversion(test_html)
puts "Input: #{test_html.inspect} -> Output: #{result.inspect}"
end
Integration with Web Scraping Workflows
When building web scraping applications, you might need to process scraped content and convert it back to HTML for storage or API responses. While tools like Puppeteer handle dynamic content extraction, Nokogiri excels at HTML manipulation and serialization tasks.
require 'nokogiri'
require 'net/http'
def scrape_and_process_html(url)
# Fetch HTML content
uri = URI(url)
response = Net::HTTP.get_response(uri)
return nil unless response.code == '200'
# Parse with Nokogiri
doc = Nokogiri::HTML(response.body)
# Remove unwanted elements
doc.css('script, style, nav, footer, .ads').remove
# Extract main content
main_content = doc.at_css('main, .content, article, .post')
# Convert back to clean HTML string
return main_content ? main_content.to_html : doc.to_html
rescue => e
puts "Error processing #{url}: #{e.message}"
return nil
end
Real-World Use Cases
Content Management Systems
require 'nokogiri'
def sanitize_user_content(user_html)
doc = Nokogiri::HTML::DocumentFragment.parse(user_html)
# Remove dangerous elements and attributes
doc.css('script, iframe, object, embed').remove
doc.css('*').each do |element|
element.remove_attribute('onclick')
element.remove_attribute('onload')
element.remove_attribute('onerror')
end
# Return cleaned HTML
doc.to_html
end
user_input = '<p>Safe content</p><script>alert("danger")</script>'
clean_html = sanitize_user_content(user_input)
puts clean_html # Output: <p>Safe content</p>
Template Processing
require 'nokogiri'
def process_email_template(template_html, variables)
doc = Nokogiri::HTML::DocumentFragment.parse(template_html)
# Replace placeholders with actual values
variables.each do |key, value|
doc.css("[data-placeholder='#{key}']").each do |element|
element.content = value
element.remove_attribute('data-placeholder')
end
end
doc.to_html
end
template = '<div><h1 data-placeholder="title">Title</h1><p data-placeholder="content">Content</p></div>'
variables = { 'title' => 'Welcome!', 'content' => 'Thank you for signing up.' }
processed = process_email_template(template, variables)
puts processed
Comparison with Other Serialization Methods
to_html vs to_s vs to_xml
require 'nokogiri'
html = '<div><p>Test content</p></div>'
doc = Nokogiri::HTML::DocumentFragment.parse(html)
puts "to_html: #{doc.to_html}"
puts "to_s: #{doc.to_s}"
puts "to_xml: #{doc.to_xml}"
Understanding when to use each method:
- to_html
: Best for HTML output with proper HTML formatting
- to_s
: Alias for to_html
in most contexts
- to_xml
: Use when you need XML-compliant output
Conclusion
Converting Nokogiri documents back to HTML strings is straightforward with the to_html
method, but Nokogiri provides extensive options for controlling the output format, encoding, and structure. Whether you're building a web scraper that needs to clean and reformat HTML content, or developing an application that manipulates HTML documents, understanding these conversion methods will help you generate the exact HTML output you need.
For complex web scraping scenarios involving dynamic content, consider combining Nokogiri's HTML manipulation capabilities with tools that can handle JavaScript-heavy websites and modern web applications for a complete scraping solution.