Table of contents

How do I search for elements containing specific text with Nokogiri?

Searching for elements containing specific text is a fundamental requirement in web scraping. Nokogiri, Ruby's premier HTML and XML parsing library, provides several powerful methods to locate elements based on their text content. This comprehensive guide covers various approaches from basic text matching to advanced pattern searching.

Understanding Text Search in Nokogiri

Nokogiri offers multiple ways to search for elements containing specific text:

  1. XPath text functions - Most versatile and powerful
  2. CSS selectors with custom filters - Limited but familiar syntax
  3. Ruby enumeration methods - Flexible programmatic approach
  4. Regular expression matching - For pattern-based searching

XPath Text Search Methods

Basic Text Matching with text()

The most straightforward approach uses XPath's text() function:

require 'nokogiri'
require 'open-uri'

# Sample HTML
html = <<~HTML
  <div class="container">
    <p>Welcome to our website</p>
    <p>Contact us for more information</p>
    <span>Welcome visitors</span>
    <a href="/about">About us</a>
  </div>
HTML

doc = Nokogiri::HTML(html)

# Find elements with exact text match
elements = doc.xpath("//p[text()='Welcome to our website']")
puts elements.first.text if elements.any?
# Output: Welcome to our website

Partial Text Matching with contains()

For more flexible searching, use the contains() function:

# Find elements containing specific text
welcome_elements = doc.xpath("//*[contains(text(), 'Welcome')]")
welcome_elements.each do |element|
  puts "#{element.name}: #{element.text}"
end
# Output: 
# p: Welcome to our website
# span: Welcome visitors

Case-Insensitive Text Search

XPath doesn't have built-in case-insensitive functions, but you can use the translate() function:

# Case-insensitive search
search_text = "WELCOME"
elements = doc.xpath("//*[contains(translate(text(), 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), '#{search_text.downcase}')]")

elements.each do |element|
  puts "Found: #{element.text}"
end

Advanced XPath Text Queries

# Find elements that start with specific text
starts_with = doc.xpath("//p[starts-with(text(), 'Welcome')]")

# Find elements that end with specific text (XPath 2.0+, limited support)
# Alternative approach using substring
ends_with = doc.xpath("//p[substring(text(), string-length(text()) - string-length('website') + 1) = 'website']")

# Find elements with text length constraints
long_text = doc.xpath("//p[string-length(text()) > 20]")

# Combine text search with attribute conditions
specific_links = doc.xpath("//a[contains(text(), 'About') and @href='/about']")

CSS Selectors with Ruby Filters

While CSS selectors don't have native text search capabilities, you can combine them with Ruby's enumeration methods:

# Find all paragraphs, then filter by text content
paragraphs_with_welcome = doc.css('p').select { |p| p.text.include?('Welcome') }

# Case-insensitive filtering
case_insensitive = doc.css('*').select { |element| 
  element.text.downcase.include?('welcome') 
}

# Regular expression matching
regex_match = doc.css('p').select { |p| 
  p.text.match?(/welcome/i) 
}

Working with Nested Text Content

When dealing with elements that contain nested tags, consider different text extraction methods:

html_nested = <<~HTML
  <div class="article">
    <h2>Main <span class="highlight">Title</span> Here</h2>
    <p>This is a <strong>sample</strong> paragraph with <em>emphasis</em>.</p>
  </div>
HTML

doc_nested = Nokogiri::HTML(html_nested)

# Search in concatenated text content (includes nested elements)
full_text_search = doc_nested.xpath("//h2[contains(., 'Main Title')]")

# Search only in direct text nodes
direct_text_only = doc_nested.xpath("//h2[contains(text(), 'Main')]")

# Get all text content including nested elements
puts doc_nested.css('h2').first.text
# Output: Main Title Here

# Get only direct text content
puts doc_nested.css('h2').first.xpath('text()').text
# Output: Main  Here

Regular Expression Text Search

For pattern-based searching, combine Nokogiri with Ruby's regex capabilities:

# Search for email patterns
html_with_emails = <<~HTML
  <div>
    <p>Contact: john@example.com</p>
    <p>Support: support@company.org</p>
    <p>Invalid: not-an-email</p>
  </div>
HTML

doc_emails = Nokogiri::HTML(html_with_emails)

email_regex = /\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/

# Find elements containing email addresses
email_elements = doc_emails.css('p').select do |p|
  p.text.match?(email_regex)
end

email_elements.each do |element|
  email = element.text.match(email_regex)[0]
  puts "Found email: #{email}"
end

Performance Considerations

Efficient XPath Queries

# More efficient: specific element targeting
efficient = doc.xpath("//p[contains(text(), 'Welcome')]")

# Less efficient: searching all elements
inefficient = doc.xpath("//*[contains(text(), 'Welcome')]")

# Most efficient: combining with other constraints
optimized = doc.xpath("//div[@class='content']//p[contains(text(), 'Welcome')]")

Caching and Reuse

class TextSearcher
  def initialize(document)
    @doc = document
    @cache = {}
  end

  def find_by_text(text, element_type = '*')
    cache_key = "#{element_type}_#{text}"
    @cache[cache_key] ||= @doc.xpath("//#{element_type}[contains(text(), '#{text}')]")
  end
end

# Usage
searcher = TextSearcher.new(doc)
results1 = searcher.find_by_text('Welcome', 'p')
results2 = searcher.find_by_text('Welcome', 'p') # Retrieved from cache

Real-World Examples

Scraping Product Information

require 'nokogiri'
require 'net/http'

# Example: Finding products by description keywords
def find_products_by_keyword(html, keyword)
  doc = Nokogiri::HTML(html)

  # Find product containers with specific text in description
  products = doc.xpath("//div[@class='product'][.//p[contains(translate(text(), 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), '#{keyword.downcase}')]]")

  products.map do |product|
    {
      title: product.css('.product-title').text.strip,
      description: product.css('.product-description').text.strip,
      price: product.css('.price').text.strip
    }
  end
end

News Article Extraction

def extract_articles_with_topic(html, topic)
  doc = Nokogiri::HTML(html)

  # Find articles containing specific topic in headline or content
  articles = doc.xpath("//article[.//h1[contains(text(), '#{topic}')] or .//h2[contains(text(), '#{topic}')] or .//p[contains(text(), '#{topic}')]]")

  articles.map do |article|
    {
      headline: article.css('h1, h2').first&.text&.strip,
      summary: article.css('p').first&.text&.strip,
      link: article.css('a').first&.[]('href')
    }
  end
end

Error Handling and Edge Cases

def safe_text_search(doc, text, selector = '*')
  begin
    # Escape special characters in XPath
    escaped_text = text.gsub("'", "&apos;").gsub('"', "&quot;")

    elements = doc.xpath("//#{selector}[contains(text(), '#{escaped_text}')]")

    # Handle empty results
    return [] if elements.empty?

    # Filter out elements with only whitespace
    elements.reject { |el| el.text.strip.empty? }

  rescue Nokogiri::XML::XPath::SyntaxError => e
    puts "XPath syntax error: #{e.message}"
    []
  rescue => e
    puts "Unexpected error: #{e.message}"
    []
  end
end

# Usage with error handling
results = safe_text_search(doc, "Welcome to our site")

Best Practices

  1. Use specific selectors: Combine text search with element type or class constraints for better performance
  2. Handle encoding: Ensure proper character encoding when dealing with international content
  3. Normalize whitespace: Use strip and consider multiple spaces/line breaks
  4. Cache results: Store frequently accessed elements to improve performance
  5. Validate inputs: Sanitize search text to prevent XPath injection

Integration with Modern Scraping Workflows

When working with JavaScript-heavy sites, you might need to combine Nokogiri with browser automation tools. For complex scenarios involving dynamic content loading, consider using headless browsers first to render the page, then parse the resulting HTML with Nokogiri.

For applications requiring authentication handling, you can use browser automation to log in and obtain the authenticated HTML, then use Nokogiri's powerful text search capabilities for data extraction.

Conclusion

Nokogiri provides robust text searching capabilities through XPath functions, CSS selectors with Ruby filtering, and regular expression matching. The choice of method depends on your specific requirements: use XPath for complex text queries, CSS selectors with Ruby filters for simple cases, and regular expressions for pattern matching. Always consider performance implications and implement proper error handling for production applications.

Understanding these text search techniques enables you to build more sophisticated web scrapers that can reliably extract content based on textual patterns, making your data extraction workflows more flexible and powerful.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon