Table of contents

How do I Parse HTML from a String Using Nokogiri?

Nokogiri is a powerful Ruby gem for parsing HTML and XML documents. One of its most common use cases is parsing HTML content from strings, whether you've retrieved HTML from an API response, read it from a file, or need to process HTML fragments. This comprehensive guide will show you how to effectively parse HTML strings using Nokogiri.

Basic HTML String Parsing

The simplest way to parse HTML from a string is using Nokogiri::HTML():

require 'nokogiri'

html_string = <<-HTML
  <html>
    <head>
      <title>Sample Page</title>
    </head>
    <body>
      <div class="container">
        <h1>Welcome to My Site</h1>
        <p>This is a sample paragraph.</p>
        <ul>
          <li>Item 1</li>
          <li>Item 2</li>
          <li>Item 3</li>
        </ul>
      </div>
    </body>
  </html>
HTML

# Parse the HTML string
doc = Nokogiri::HTML(html_string)

# Extract the title
title = doc.at('title').text
puts "Page title: #{title}"  # Output: Page title: Sample Page

# Find all list items
items = doc.css('li').map(&:text)
puts "List items: #{items}"  # Output: List items: ["Item 1", "Item 2", "Item 3"]

Parsing HTML Fragments

When working with HTML fragments (partial HTML without a complete document structure), Nokogiri automatically handles the parsing:

require 'nokogiri'

# HTML fragment without html/body tags
fragment = '<div class="product"><h2>Product Name</h2><span class="price">$19.99</span></div>'

doc = Nokogiri::HTML(fragment)

# Extract product information
product_name = doc.at('h2').text
price = doc.at('.price').text

puts "Product: #{product_name}, Price: #{price}"
# Output: Product: Product Name, Price: $19.99

Advanced Parsing Options

Nokogiri provides several options to customize the parsing behavior:

require 'nokogiri'

html_string = '<html><body><p>Some content</p></body></html>'

# Parse with specific options
doc = Nokogiri::HTML(html_string) do |config|
  config.strict    # Raise errors for malformed HTML
  config.noblanks  # Remove blank text nodes
  config.noent     # Substitute entities
  config.noerror   # Suppress error reports
  config.nowarning # Suppress warning reports
end

# Alternative syntax for options
doc = Nokogiri::HTML(html_string, nil, 'UTF-8', Nokogiri::XML::ParseOptions::NOBLANKS)

Working with Malformed HTML

Nokogiri is forgiving with malformed HTML and will attempt to fix common issues:

require 'nokogiri'

# Malformed HTML with unclosed tags
malformed_html = '<div><p>Paragraph without closing tag<span>Span content</div>'

doc = Nokogiri::HTML(malformed_html)

# Nokogiri automatically fixes the structure
puts doc.to_html
# Nokogiri will properly close tags and create valid HTML structure

Extracting Data from Real-World HTML

Here's a practical example of parsing HTML from a web scraping scenario:

require 'nokogiri'
require 'net/http'
require 'uri'

# Sample HTML that might come from a web scraping request
html_response = <<-HTML
  <html>
    <body>
      <div class="article-list">
        <article class="post" data-id="1">
          <h2 class="title">First Blog Post</h2>
          <div class="meta">
            <span class="author">John Doe</span>
            <time datetime="2024-01-15">January 15, 2024</time>
          </div>
          <p class="excerpt">This is the first blog post excerpt...</p>
        </article>
        <article class="post" data-id="2">
          <h2 class="title">Second Blog Post</h2>
          <div class="meta">
            <span class="author">Jane Smith</span>
            <time datetime="2024-01-16">January 16, 2024</time>
          </div>
          <p class="excerpt">This is the second blog post excerpt...</p>
        </article>
      </div>
    </body>
  </html>
HTML

doc = Nokogiri::HTML(html_response)

# Extract structured data from all articles
articles = doc.css('article.post').map do |article|
  {
    id: article['data-id'],
    title: article.at('h2.title').text.strip,
    author: article.at('.author').text.strip,
    date: article.at('time')['datetime'],
    excerpt: article.at('.excerpt').text.strip
  }
end

articles.each do |article|
  puts "ID: #{article[:id]}"
  puts "Title: #{article[:title]}"
  puts "Author: #{article[:author]}"
  puts "Date: #{article[:date]}"
  puts "Excerpt: #{article[:excerpt]}"
  puts "---"
end

Error Handling and Validation

Always implement proper error handling when parsing HTML strings:

require 'nokogiri'

def safe_parse_html(html_string)
  return nil if html_string.nil? || html_string.empty?

  begin
    doc = Nokogiri::HTML(html_string)

    # Validate that parsing was successful
    if doc.errors.any?
      puts "Parsing warnings/errors:"
      doc.errors.each { |error| puts "  #{error}" }
    end

    doc
  rescue => e
    puts "Failed to parse HTML: #{e.message}"
    nil
  end
end

# Usage
html = '<div><p>Test content</p></div>'
doc = safe_parse_html(html)

if doc
  content = doc.at('p')&.text
  puts "Extracted content: #{content}"
else
  puts "Failed to parse HTML"
end

Working with Different Encodings

When dealing with HTML strings from various sources, encoding can be important:

require 'nokogiri'

# HTML with specific encoding
html_with_encoding = <<-HTML
  <!DOCTYPE html>
  <html>
    <head>
      <meta charset="UTF-8">
      <title>Spécial Charactërs</title>
    </head>
    <body>
      <p>Café, naïve, résumé</p>
    </body>
  </html>
HTML

# Parse with explicit encoding
doc = Nokogiri::HTML(html_with_encoding, nil, 'UTF-8')

# Extract text with proper encoding
text_content = doc.at('p').text
puts text_content  # Output: Café, naïve, résumé

Performance Considerations

For large HTML strings or high-frequency parsing, consider these performance tips:

require 'nokogiri'
require 'benchmark'

large_html = '<div>' + ('<p>Content</p>' * 1000) + '</div>'

# Benchmark different parsing approaches
Benchmark.bm do |x|
  x.report("Standard parsing:") do
    1000.times { Nokogiri::HTML(large_html) }
  end

  x.report("Fragment parsing:") do
    1000.times { Nokogiri::HTML::DocumentFragment.parse(large_html) }
  end

  x.report("With NOBLANKS:") do
    1000.times { 
      Nokogiri::HTML(large_html, nil, 'UTF-8', Nokogiri::XML::ParseOptions::NOBLANKS) 
    }
  end
end

Common Use Cases and Patterns

Cleaning HTML Content

require 'nokogiri'

def clean_html(html_string)
  doc = Nokogiri::HTML(html_string)

  # Remove script and style tags
  doc.search('script, style').remove

  # Remove all attributes except specific ones
  doc.search('*').each do |element|
    allowed_attrs = %w[href src alt title]
    element.attributes.each do |name, attr|
      attr.remove unless allowed_attrs.include?(name)
    end
  end

  doc.at('body').inner_html
end

dirty_html = '<div onclick="malicious()"><script>alert("xss")</script><p>Clean content</p></div>'
clean_content = clean_html(dirty_html)
puts clean_content  # Output: <div><p>Clean content</p></div>

Extracting Links and Images

require 'nokogiri'

html_content = <<-HTML
  <div>
    <a href="https://example.com/page1">Link 1</a>
    <a href="/relative-link">Link 2</a>
    <img src="image1.jpg" alt="Image 1">
    <img src="https://example.com/image2.png" alt="Image 2">
  </div>
HTML

doc = Nokogiri::HTML(html_content)

# Extract all links
links = doc.css('a').map do |link|
  {
    text: link.text.strip,
    href: link['href']
  }
end

# Extract all images
images = doc.css('img').map do |img|
  {
    src: img['src'],
    alt: img['alt']
  }
end

puts "Links found:"
links.each { |link| puts "  #{link[:text]} -> #{link[:href]}" }

puts "Images found:"
images.each { |img| puts "  #{img[:alt]} -> #{img[:src]}" }

Integration with Web Scraping Workflows

When building web scraping applications, parsing HTML from strings is often part of a larger workflow. While Nokogiri is excellent for parsing static HTML content, you might need additional tools for handling dynamic content that requires JavaScript execution or managing complex authentication flows.

Best Practices

  1. Always handle errors: Wrap parsing operations in try-catch blocks
  2. Validate input: Check for nil or empty strings before parsing
  3. Use appropriate selectors: CSS selectors are often more readable than XPath
  4. Consider encoding: Specify encoding when dealing with international content
  5. Memory management: For large documents, consider using streaming parsers
  6. Sanitize content: Remove potentially dangerous elements when processing untrusted HTML

Conclusion

Nokogiri provides a robust and flexible way to parse HTML from strings in Ruby applications. Whether you're processing simple HTML fragments or complex documents, understanding these parsing techniques will help you extract data efficiently and reliably. Remember to always validate your input, handle errors gracefully, and choose the parsing options that best fit your specific use case.

For more complex web scraping scenarios involving dynamic content, you might also want to explore browser automation tools that can handle JavaScript-rendered pages alongside Nokogiri's powerful HTML parsing capabilities.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon