How do I extract text from an element using Nokogiri?

Nokogiri is a powerful Ruby library for parsing HTML and XML documents. This guide shows you how to extract text content from HTML elements using various methods and selectors.

Installation

First, install Nokogiri in your Ruby project:

# Add to Gemfile
gem 'nokogiri'

# Or install directly
gem install nokogiri

Basic Text Extraction

Parsing HTML and Extracting Text

require 'nokogiri'

html_content = <<-HTML
<html>
  <body>
    <h1>Welcome to My Website</h1>
    <p>This is a paragraph with <a href="#">a link</a>.</p>
    <div class="content">
      <span>Important note</span>
    </div>
  </body>
</html>
HTML

# Parse the HTML document
doc = Nokogiri::HTML(html_content)

# Extract text from specific elements
title = doc.css('h1').text
puts title  # Output: Welcome to My Website

paragraph = doc.css('p').text
puts paragraph  # Output: This is a paragraph with a link.

Text Extraction Methods

1. Using .text Method

The .text method extracts all text content, including nested elements:

# Extract text including nested elements
paragraph = doc.css('p').first
full_text = paragraph.text
puts full_text  # Output: This is a paragraph with a link.

2. Using .inner_text Method

Similar to .text, but with slightly different whitespace handling:

element = doc.css('div.content').first
content = element.inner_text
puts content  # Output: Important note

3. Extracting Text from Direct Children Only

To get text from only the immediate element (excluding nested tags):

paragraph = doc.css('p').first

# Method 1: Using text nodes
direct_text = paragraph.children.select(&:text?).map(&:content).join
puts direct_text  # Output: This is a paragraph with 

# Method 2: Using xpath for text nodes
direct_text = paragraph.xpath('text()').map(&:content).join
puts direct_text  # Output: This is a paragraph with 

Element Selection Methods

CSS Selectors

# By tag name
titles = doc.css('h1')

# By class
content_divs = doc.css('.content')

# By ID
header = doc.css('#header')

# By attribute
links = doc.css('a[href]')

# Complex selectors
nested_spans = doc.css('div.content span')

XPath Expressions

# By tag name
titles = doc.xpath('//h1')

# By class
content_divs = doc.xpath("//div[@class='content']")

# By text content
links_with_text = doc.xpath("//a[contains(text(), 'link')]")

# Complex expressions
spans_in_content = doc.xpath("//div[@class='content']//span")

Handling Multiple Elements

Extract Text from All Matching Elements

html_content = <<-HTML
<ul>
  <li>Item 1</li>
  <li>Item 2</li>
  <li>Item 3</li>
</ul>
HTML

doc = Nokogiri::HTML(html_content)

# Method 1: Extract all at once
all_items = doc.css('li').map(&:text)
puts all_items  # Output: ["Item 1", "Item 2", "Item 3"]

# Method 2: Iterate through elements
doc.css('li').each do |item|
  puts item.text
end

Error Handling and Safety

Checking if Elements Exist

require 'nokogiri'

html_content = "<html><body><h1>Title</h1></body></html>"
doc = Nokogiri::HTML(html_content)

# Safe text extraction
h1_element = doc.css('h1').first
if h1_element
  title = h1_element.text
  puts "Title: #{title}"
else
  puts "H1 element not found"
end

# Using safe navigation (Ruby 2.3+)
title = doc.css('h1').first&.text
puts title || "No title found"

# One-liner with fallback
title = doc.css('h1').first&.text || "Default Title"

Working with Different Content Sources

From Files

require 'nokogiri'

# Reading from a file
doc = Nokogiri::HTML(File.open('index.html'))
title = doc.css('title').text

From URLs (with HTTP libraries)

require 'nokogiri'
require 'net/http'

uri = URI('https://example.com')
response = Net::HTTP.get_response(uri)

if response.code == '200'
  doc = Nokogiri::HTML(response.body)
  title = doc.css('title').text
  puts title
end

Complete Example: Web Scraping

require 'nokogiri'

html_content = <<-HTML
<html>
  <head><title>Product Page</title></head>
  <body>
    <h1 class="product-title">Amazing Widget</h1>
    <div class="price">$29.99</div>
    <div class="description">
      This is an <strong>amazing</strong> product that will
      <em>change your life</em>!
    </div>
    <ul class="features">
      <li>Feature 1</li>
      <li>Feature 2</li>
      <li>Feature 3</li>
    </ul>
  </body>
</html>
HTML

doc = Nokogiri::HTML(html_content)

# Extract product information
product = {
  title: doc.css('.product-title').first&.text&.strip,
  price: doc.css('.price').first&.text&.strip,
  description: doc.css('.description').first&.text&.strip,
  features: doc.css('.features li').map(&:text)
}

puts "Product: #{product[:title]}"
puts "Price: #{product[:price]}"
puts "Description: #{product[:description]}"
puts "Features: #{product[:features].join(', ')}"

Best Practices

  1. Always check for element existence before calling .text
  2. Use .strip to remove leading/trailing whitespace
  3. Choose the right selector - CSS for simple selections, XPath for complex logic
  4. Handle encoding issues by specifying encoding when parsing
  5. Use safe navigation (&.) to avoid nil errors

This comprehensive approach ensures robust text extraction from HTML elements using Nokogiri in your Ruby applications.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon