How do I extract text from an element using Nokogiri?

Nokogiri is a Ruby library that makes it easy to parse HTML and XML documents, extract information, and manipulate these documents. To extract text from an element using Nokogiri, you'll first need to parse the HTML document and then use Nokogiri's searching methods to find the desired element.

Below is a step-by-step guide with an example on how to extract text from an element using Nokogiri:

Step 1: Install Nokogiri

If you haven't installed Nokogiri yet, you can do so using the following command:

gem install nokogiri

Step 2: Parse the HTML Document

Assuming you have your HTML content as a string, you can parse it using Nokogiri like this:

require 'nokogiri'

html_content = <<-HTML
<html>
  <body>
    <h1>Welcome to My Website</h1>
    <p>This is a paragraph with <a href="#">a link</a>.</p>
  </body>
</html>
HTML

doc = Nokogiri::HTML(html_content)

If you're loading the HTML content from a file or a URL, you can use Nokogiri::HTML(File.open("path_to_your_file.html")) or Nokogiri::HTML(open("http://example.com")) respectively.

Step 3: Search for the Element

You can use CSS selectors or XPath expressions to find the element from which you want to extract text. Here's an example using both methods:

# Using CSS selectors
h1_element = doc.css('h1').first

# Using XPath expressions
# h1_element = doc.xpath('//h1').first

Step 4: Extract the Text

Once you have the element, you can call the text method to get its text content:

text = h1_element.text
puts text
# Output: Welcome to My Website

If you want to extract all text content including the nested elements, Nokogiri will handle that for you with the text method. If you want text from the immediate element only and not from its children, you can use the content method on the Nokogiri::XML::NodeSet:

paragraph = doc.css('p').first
text_with_children = paragraph.text
puts text_with_children
# Output: This is a paragraph with a link.

# To get text from the immediate element only (Nokogiri >= 1.4.0)
immediate_text = paragraph.children.find { |child| child.text? }.content
puts immediate_text
# Output: This is a paragraph with 

Full Example

Here's a full example putting all the steps together:

require 'nokogiri'

html_content = <<-HTML
<html>
  <body>
    <h1>Welcome to My Website</h1>
    <p>This is a paragraph with <a href="#">a link</a>.</p>
  </body>
</html>
HTML

# Parse the HTML
doc = Nokogiri::HTML(html_content)

# Find the element
h1_element = doc.css('h1').first

# Extract the text
text = h1_element.text
puts text

This script will output the text contained within the <h1> tag of the HTML content. Remember to handle cases where the element might not exist to avoid nil errors when calling .text on an element that doesn't exist.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon