How do I search for elements containing specific text with Nokogiri?
Searching for elements containing specific text is a fundamental requirement in web scraping. Nokogiri, Ruby's premier HTML and XML parsing library, provides several powerful methods to locate elements based on their text content. This comprehensive guide covers various approaches from basic text matching to advanced pattern searching.
Understanding Text Search in Nokogiri
Nokogiri offers multiple ways to search for elements containing specific text:
- XPath text functions - Most versatile and powerful
- CSS selectors with custom filters - Limited but familiar syntax
- Ruby enumeration methods - Flexible programmatic approach
- Regular expression matching - For pattern-based searching
XPath Text Search Methods
Basic Text Matching with text()
The most straightforward approach uses XPath's text()
function:
require 'nokogiri'
require 'open-uri'
# Sample HTML
html = <<~HTML
<div class="container">
<p>Welcome to our website</p>
<p>Contact us for more information</p>
<span>Welcome visitors</span>
<a href="/about">About us</a>
</div>
HTML
doc = Nokogiri::HTML(html)
# Find elements with exact text match
elements = doc.xpath("//p[text()='Welcome to our website']")
puts elements.first.text if elements.any?
# Output: Welcome to our website
Partial Text Matching with contains()
For more flexible searching, use the contains()
function:
# Find elements containing specific text
welcome_elements = doc.xpath("//*[contains(text(), 'Welcome')]")
welcome_elements.each do |element|
puts "#{element.name}: #{element.text}"
end
# Output:
# p: Welcome to our website
# span: Welcome visitors
Case-Insensitive Text Search
XPath doesn't have built-in case-insensitive functions, but you can use the translate()
function:
# Case-insensitive search
search_text = "WELCOME"
elements = doc.xpath("//*[contains(translate(text(), 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), '#{search_text.downcase}')]")
elements.each do |element|
puts "Found: #{element.text}"
end
Advanced XPath Text Queries
# Find elements that start with specific text
starts_with = doc.xpath("//p[starts-with(text(), 'Welcome')]")
# Find elements that end with specific text (XPath 2.0+, limited support)
# Alternative approach using substring
ends_with = doc.xpath("//p[substring(text(), string-length(text()) - string-length('website') + 1) = 'website']")
# Find elements with text length constraints
long_text = doc.xpath("//p[string-length(text()) > 20]")
# Combine text search with attribute conditions
specific_links = doc.xpath("//a[contains(text(), 'About') and @href='/about']")
CSS Selectors with Ruby Filters
While CSS selectors don't have native text search capabilities, you can combine them with Ruby's enumeration methods:
# Find all paragraphs, then filter by text content
paragraphs_with_welcome = doc.css('p').select { |p| p.text.include?('Welcome') }
# Case-insensitive filtering
case_insensitive = doc.css('*').select { |element|
element.text.downcase.include?('welcome')
}
# Regular expression matching
regex_match = doc.css('p').select { |p|
p.text.match?(/welcome/i)
}
Working with Nested Text Content
When dealing with elements that contain nested tags, consider different text extraction methods:
html_nested = <<~HTML
<div class="article">
<h2>Main <span class="highlight">Title</span> Here</h2>
<p>This is a <strong>sample</strong> paragraph with <em>emphasis</em>.</p>
</div>
HTML
doc_nested = Nokogiri::HTML(html_nested)
# Search in concatenated text content (includes nested elements)
full_text_search = doc_nested.xpath("//h2[contains(., 'Main Title')]")
# Search only in direct text nodes
direct_text_only = doc_nested.xpath("//h2[contains(text(), 'Main')]")
# Get all text content including nested elements
puts doc_nested.css('h2').first.text
# Output: Main Title Here
# Get only direct text content
puts doc_nested.css('h2').first.xpath('text()').text
# Output: Main Here
Regular Expression Text Search
For pattern-based searching, combine Nokogiri with Ruby's regex capabilities:
# Search for email patterns
html_with_emails = <<~HTML
<div>
<p>Contact: john@example.com</p>
<p>Support: support@company.org</p>
<p>Invalid: not-an-email</p>
</div>
HTML
doc_emails = Nokogiri::HTML(html_with_emails)
email_regex = /\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/
# Find elements containing email addresses
email_elements = doc_emails.css('p').select do |p|
p.text.match?(email_regex)
end
email_elements.each do |element|
email = element.text.match(email_regex)[0]
puts "Found email: #{email}"
end
Performance Considerations
Efficient XPath Queries
# More efficient: specific element targeting
efficient = doc.xpath("//p[contains(text(), 'Welcome')]")
# Less efficient: searching all elements
inefficient = doc.xpath("//*[contains(text(), 'Welcome')]")
# Most efficient: combining with other constraints
optimized = doc.xpath("//div[@class='content']//p[contains(text(), 'Welcome')]")
Caching and Reuse
class TextSearcher
def initialize(document)
@doc = document
@cache = {}
end
def find_by_text(text, element_type = '*')
cache_key = "#{element_type}_#{text}"
@cache[cache_key] ||= @doc.xpath("//#{element_type}[contains(text(), '#{text}')]")
end
end
# Usage
searcher = TextSearcher.new(doc)
results1 = searcher.find_by_text('Welcome', 'p')
results2 = searcher.find_by_text('Welcome', 'p') # Retrieved from cache
Real-World Examples
Scraping Product Information
require 'nokogiri'
require 'net/http'
# Example: Finding products by description keywords
def find_products_by_keyword(html, keyword)
doc = Nokogiri::HTML(html)
# Find product containers with specific text in description
products = doc.xpath("//div[@class='product'][.//p[contains(translate(text(), 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), '#{keyword.downcase}')]]")
products.map do |product|
{
title: product.css('.product-title').text.strip,
description: product.css('.product-description').text.strip,
price: product.css('.price').text.strip
}
end
end
News Article Extraction
def extract_articles_with_topic(html, topic)
doc = Nokogiri::HTML(html)
# Find articles containing specific topic in headline or content
articles = doc.xpath("//article[.//h1[contains(text(), '#{topic}')] or .//h2[contains(text(), '#{topic}')] or .//p[contains(text(), '#{topic}')]]")
articles.map do |article|
{
headline: article.css('h1, h2').first&.text&.strip,
summary: article.css('p').first&.text&.strip,
link: article.css('a').first&.[]('href')
}
end
end
Error Handling and Edge Cases
def safe_text_search(doc, text, selector = '*')
begin
# Escape special characters in XPath
escaped_text = text.gsub("'", "'").gsub('"', """)
elements = doc.xpath("//#{selector}[contains(text(), '#{escaped_text}')]")
# Handle empty results
return [] if elements.empty?
# Filter out elements with only whitespace
elements.reject { |el| el.text.strip.empty? }
rescue Nokogiri::XML::XPath::SyntaxError => e
puts "XPath syntax error: #{e.message}"
[]
rescue => e
puts "Unexpected error: #{e.message}"
[]
end
end
# Usage with error handling
results = safe_text_search(doc, "Welcome to our site")
Best Practices
- Use specific selectors: Combine text search with element type or class constraints for better performance
- Handle encoding: Ensure proper character encoding when dealing with international content
- Normalize whitespace: Use
strip
and consider multiple spaces/line breaks - Cache results: Store frequently accessed elements to improve performance
- Validate inputs: Sanitize search text to prevent XPath injection
Integration with Modern Scraping Workflows
When working with JavaScript-heavy sites, you might need to combine Nokogiri with browser automation tools. For complex scenarios involving dynamic content loading, consider using headless browsers first to render the page, then parse the resulting HTML with Nokogiri.
For applications requiring authentication handling, you can use browser automation to log in and obtain the authenticated HTML, then use Nokogiri's powerful text search capabilities for data extraction.
Conclusion
Nokogiri provides robust text searching capabilities through XPath functions, CSS selectors with Ruby filtering, and regular expression matching. The choice of method depends on your specific requirements: use XPath for complex text queries, CSS selectors with Ruby filters for simple cases, and regular expressions for pattern matching. Always consider performance implications and implement proper error handling for production applications.
Understanding these text search techniques enables you to build more sophisticated web scrapers that can reliably extract content based on textual patterns, making your data extraction workflows more flexible and powerful.