How do I Parse HTML from a String Using Nokogiri?
Nokogiri is a powerful Ruby gem for parsing HTML and XML documents. One of its most common use cases is parsing HTML content from strings, whether you've retrieved HTML from an API response, read it from a file, or need to process HTML fragments. This comprehensive guide will show you how to effectively parse HTML strings using Nokogiri.
Basic HTML String Parsing
The simplest way to parse HTML from a string is using Nokogiri::HTML()
:
require 'nokogiri'
html_string = <<-HTML
<html>
<head>
<title>Sample Page</title>
</head>
<body>
<div class="container">
<h1>Welcome to My Site</h1>
<p>This is a sample paragraph.</p>
<ul>
<li>Item 1</li>
<li>Item 2</li>
<li>Item 3</li>
</ul>
</div>
</body>
</html>
HTML
# Parse the HTML string
doc = Nokogiri::HTML(html_string)
# Extract the title
title = doc.at('title').text
puts "Page title: #{title}" # Output: Page title: Sample Page
# Find all list items
items = doc.css('li').map(&:text)
puts "List items: #{items}" # Output: List items: ["Item 1", "Item 2", "Item 3"]
Parsing HTML Fragments
When working with HTML fragments (partial HTML without a complete document structure), Nokogiri automatically handles the parsing:
require 'nokogiri'
# HTML fragment without html/body tags
fragment = '<div class="product"><h2>Product Name</h2><span class="price">$19.99</span></div>'
doc = Nokogiri::HTML(fragment)
# Extract product information
product_name = doc.at('h2').text
price = doc.at('.price').text
puts "Product: #{product_name}, Price: #{price}"
# Output: Product: Product Name, Price: $19.99
Advanced Parsing Options
Nokogiri provides several options to customize the parsing behavior:
require 'nokogiri'
html_string = '<html><body><p>Some content</p></body></html>'
# Parse with specific options
doc = Nokogiri::HTML(html_string) do |config|
config.strict # Raise errors for malformed HTML
config.noblanks # Remove blank text nodes
config.noent # Substitute entities
config.noerror # Suppress error reports
config.nowarning # Suppress warning reports
end
# Alternative syntax for options
doc = Nokogiri::HTML(html_string, nil, 'UTF-8', Nokogiri::XML::ParseOptions::NOBLANKS)
Working with Malformed HTML
Nokogiri is forgiving with malformed HTML and will attempt to fix common issues:
require 'nokogiri'
# Malformed HTML with unclosed tags
malformed_html = '<div><p>Paragraph without closing tag<span>Span content</div>'
doc = Nokogiri::HTML(malformed_html)
# Nokogiri automatically fixes the structure
puts doc.to_html
# Nokogiri will properly close tags and create valid HTML structure
Extracting Data from Real-World HTML
Here's a practical example of parsing HTML from a web scraping scenario:
require 'nokogiri'
require 'net/http'
require 'uri'
# Sample HTML that might come from a web scraping request
html_response = <<-HTML
<html>
<body>
<div class="article-list">
<article class="post" data-id="1">
<h2 class="title">First Blog Post</h2>
<div class="meta">
<span class="author">John Doe</span>
<time datetime="2024-01-15">January 15, 2024</time>
</div>
<p class="excerpt">This is the first blog post excerpt...</p>
</article>
<article class="post" data-id="2">
<h2 class="title">Second Blog Post</h2>
<div class="meta">
<span class="author">Jane Smith</span>
<time datetime="2024-01-16">January 16, 2024</time>
</div>
<p class="excerpt">This is the second blog post excerpt...</p>
</article>
</div>
</body>
</html>
HTML
doc = Nokogiri::HTML(html_response)
# Extract structured data from all articles
articles = doc.css('article.post').map do |article|
{
id: article['data-id'],
title: article.at('h2.title').text.strip,
author: article.at('.author').text.strip,
date: article.at('time')['datetime'],
excerpt: article.at('.excerpt').text.strip
}
end
articles.each do |article|
puts "ID: #{article[:id]}"
puts "Title: #{article[:title]}"
puts "Author: #{article[:author]}"
puts "Date: #{article[:date]}"
puts "Excerpt: #{article[:excerpt]}"
puts "---"
end
Error Handling and Validation
Always implement proper error handling when parsing HTML strings:
require 'nokogiri'
def safe_parse_html(html_string)
return nil if html_string.nil? || html_string.empty?
begin
doc = Nokogiri::HTML(html_string)
# Validate that parsing was successful
if doc.errors.any?
puts "Parsing warnings/errors:"
doc.errors.each { |error| puts " #{error}" }
end
doc
rescue => e
puts "Failed to parse HTML: #{e.message}"
nil
end
end
# Usage
html = '<div><p>Test content</p></div>'
doc = safe_parse_html(html)
if doc
content = doc.at('p')&.text
puts "Extracted content: #{content}"
else
puts "Failed to parse HTML"
end
Working with Different Encodings
When dealing with HTML strings from various sources, encoding can be important:
require 'nokogiri'
# HTML with specific encoding
html_with_encoding = <<-HTML
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Spécial Charactërs</title>
</head>
<body>
<p>Café, naïve, résumé</p>
</body>
</html>
HTML
# Parse with explicit encoding
doc = Nokogiri::HTML(html_with_encoding, nil, 'UTF-8')
# Extract text with proper encoding
text_content = doc.at('p').text
puts text_content # Output: Café, naïve, résumé
Performance Considerations
For large HTML strings or high-frequency parsing, consider these performance tips:
require 'nokogiri'
require 'benchmark'
large_html = '<div>' + ('<p>Content</p>' * 1000) + '</div>'
# Benchmark different parsing approaches
Benchmark.bm do |x|
x.report("Standard parsing:") do
1000.times { Nokogiri::HTML(large_html) }
end
x.report("Fragment parsing:") do
1000.times { Nokogiri::HTML::DocumentFragment.parse(large_html) }
end
x.report("With NOBLANKS:") do
1000.times {
Nokogiri::HTML(large_html, nil, 'UTF-8', Nokogiri::XML::ParseOptions::NOBLANKS)
}
end
end
Common Use Cases and Patterns
Cleaning HTML Content
require 'nokogiri'
def clean_html(html_string)
doc = Nokogiri::HTML(html_string)
# Remove script and style tags
doc.search('script, style').remove
# Remove all attributes except specific ones
doc.search('*').each do |element|
allowed_attrs = %w[href src alt title]
element.attributes.each do |name, attr|
attr.remove unless allowed_attrs.include?(name)
end
end
doc.at('body').inner_html
end
dirty_html = '<div onclick="malicious()"><script>alert("xss")</script><p>Clean content</p></div>'
clean_content = clean_html(dirty_html)
puts clean_content # Output: <div><p>Clean content</p></div>
Extracting Links and Images
require 'nokogiri'
html_content = <<-HTML
<div>
<a href="https://example.com/page1">Link 1</a>
<a href="/relative-link">Link 2</a>
<img src="image1.jpg" alt="Image 1">
<img src="https://example.com/image2.png" alt="Image 2">
</div>
HTML
doc = Nokogiri::HTML(html_content)
# Extract all links
links = doc.css('a').map do |link|
{
text: link.text.strip,
href: link['href']
}
end
# Extract all images
images = doc.css('img').map do |img|
{
src: img['src'],
alt: img['alt']
}
end
puts "Links found:"
links.each { |link| puts " #{link[:text]} -> #{link[:href]}" }
puts "Images found:"
images.each { |img| puts " #{img[:alt]} -> #{img[:src]}" }
Integration with Web Scraping Workflows
When building web scraping applications, parsing HTML from strings is often part of a larger workflow. While Nokogiri is excellent for parsing static HTML content, you might need additional tools for handling dynamic content that requires JavaScript execution or managing complex authentication flows.
Best Practices
- Always handle errors: Wrap parsing operations in try-catch blocks
- Validate input: Check for nil or empty strings before parsing
- Use appropriate selectors: CSS selectors are often more readable than XPath
- Consider encoding: Specify encoding when dealing with international content
- Memory management: For large documents, consider using streaming parsers
- Sanitize content: Remove potentially dangerous elements when processing untrusted HTML
Conclusion
Nokogiri provides a robust and flexible way to parse HTML from strings in Ruby applications. Whether you're processing simple HTML fragments or complex documents, understanding these parsing techniques will help you extract data efficiently and reliably. Remember to always validate your input, handle errors gracefully, and choose the parsing options that best fit your specific use case.
For more complex web scraping scenarios involving dynamic content, you might also want to explore browser automation tools that can handle JavaScript-rendered pages alongside Nokogiri's powerful HTML parsing capabilities.