Nokogiri is a powerful Ruby library for parsing HTML and XML documents. This guide shows you how to extract text content from HTML elements using various methods and selectors.
Installation
First, install Nokogiri in your Ruby project:
# Add to Gemfile
gem 'nokogiri'
# Or install directly
gem install nokogiri
Basic Text Extraction
Parsing HTML and Extracting Text
require 'nokogiri'
html_content = <<-HTML
<html>
<body>
<h1>Welcome to My Website</h1>
<p>This is a paragraph with <a href="#">a link</a>.</p>
<div class="content">
<span>Important note</span>
</div>
</body>
</html>
HTML
# Parse the HTML document
doc = Nokogiri::HTML(html_content)
# Extract text from specific elements
title = doc.css('h1').text
puts title # Output: Welcome to My Website
paragraph = doc.css('p').text
puts paragraph # Output: This is a paragraph with a link.
Text Extraction Methods
1. Using .text
Method
The .text
method extracts all text content, including nested elements:
# Extract text including nested elements
paragraph = doc.css('p').first
full_text = paragraph.text
puts full_text # Output: This is a paragraph with a link.
2. Using .inner_text
Method
Similar to .text
, but with slightly different whitespace handling:
element = doc.css('div.content').first
content = element.inner_text
puts content # Output: Important note
3. Extracting Text from Direct Children Only
To get text from only the immediate element (excluding nested tags):
paragraph = doc.css('p').first
# Method 1: Using text nodes
direct_text = paragraph.children.select(&:text?).map(&:content).join
puts direct_text # Output: This is a paragraph with
# Method 2: Using xpath for text nodes
direct_text = paragraph.xpath('text()').map(&:content).join
puts direct_text # Output: This is a paragraph with
Element Selection Methods
CSS Selectors
# By tag name
titles = doc.css('h1')
# By class
content_divs = doc.css('.content')
# By ID
header = doc.css('#header')
# By attribute
links = doc.css('a[href]')
# Complex selectors
nested_spans = doc.css('div.content span')
XPath Expressions
# By tag name
titles = doc.xpath('//h1')
# By class
content_divs = doc.xpath("//div[@class='content']")
# By text content
links_with_text = doc.xpath("//a[contains(text(), 'link')]")
# Complex expressions
spans_in_content = doc.xpath("//div[@class='content']//span")
Handling Multiple Elements
Extract Text from All Matching Elements
html_content = <<-HTML
<ul>
<li>Item 1</li>
<li>Item 2</li>
<li>Item 3</li>
</ul>
HTML
doc = Nokogiri::HTML(html_content)
# Method 1: Extract all at once
all_items = doc.css('li').map(&:text)
puts all_items # Output: ["Item 1", "Item 2", "Item 3"]
# Method 2: Iterate through elements
doc.css('li').each do |item|
puts item.text
end
Error Handling and Safety
Checking if Elements Exist
require 'nokogiri'
html_content = "<html><body><h1>Title</h1></body></html>"
doc = Nokogiri::HTML(html_content)
# Safe text extraction
h1_element = doc.css('h1').first
if h1_element
title = h1_element.text
puts "Title: #{title}"
else
puts "H1 element not found"
end
# Using safe navigation (Ruby 2.3+)
title = doc.css('h1').first&.text
puts title || "No title found"
# One-liner with fallback
title = doc.css('h1').first&.text || "Default Title"
Working with Different Content Sources
From Files
require 'nokogiri'
# Reading from a file
doc = Nokogiri::HTML(File.open('index.html'))
title = doc.css('title').text
From URLs (with HTTP libraries)
require 'nokogiri'
require 'net/http'
uri = URI('https://example.com')
response = Net::HTTP.get_response(uri)
if response.code == '200'
doc = Nokogiri::HTML(response.body)
title = doc.css('title').text
puts title
end
Complete Example: Web Scraping
require 'nokogiri'
html_content = <<-HTML
<html>
<head><title>Product Page</title></head>
<body>
<h1 class="product-title">Amazing Widget</h1>
<div class="price">$29.99</div>
<div class="description">
This is an <strong>amazing</strong> product that will
<em>change your life</em>!
</div>
<ul class="features">
<li>Feature 1</li>
<li>Feature 2</li>
<li>Feature 3</li>
</ul>
</body>
</html>
HTML
doc = Nokogiri::HTML(html_content)
# Extract product information
product = {
title: doc.css('.product-title').first&.text&.strip,
price: doc.css('.price').first&.text&.strip,
description: doc.css('.description').first&.text&.strip,
features: doc.css('.features li').map(&:text)
}
puts "Product: #{product[:title]}"
puts "Price: #{product[:price]}"
puts "Description: #{product[:description]}"
puts "Features: #{product[:features].join(', ')}"
Best Practices
- Always check for element existence before calling
.text
- Use
.strip
to remove leading/trailing whitespace - Choose the right selector - CSS for simple selections, XPath for complex logic
- Handle encoding issues by specifying encoding when parsing
- Use safe navigation (
&.
) to avoid nil errors
This comprehensive approach ensures robust text extraction from HTML elements using Nokogiri in your Ruby applications.