Nokogiri is a Ruby library that makes it easy to parse HTML and XML documents, extract information, and manipulate these documents. To extract text from an element using Nokogiri, you'll first need to parse the HTML document and then use Nokogiri's searching methods to find the desired element.
Below is a step-by-step guide with an example on how to extract text from an element using Nokogiri:
Step 1: Install Nokogiri
If you haven't installed Nokogiri yet, you can do so using the following command:
gem install nokogiri
Step 2: Parse the HTML Document
Assuming you have your HTML content as a string, you can parse it using Nokogiri like this:
require 'nokogiri'
html_content = <<-HTML
<html>
<body>
<h1>Welcome to My Website</h1>
<p>This is a paragraph with <a href="#">a link</a>.</p>
</body>
</html>
HTML
doc = Nokogiri::HTML(html_content)
If you're loading the HTML content from a file or a URL, you can use Nokogiri::HTML(File.open("path_to_your_file.html"))
or Nokogiri::HTML(open("http://example.com"))
respectively.
Step 3: Search for the Element
You can use CSS selectors or XPath expressions to find the element from which you want to extract text. Here's an example using both methods:
# Using CSS selectors
h1_element = doc.css('h1').first
# Using XPath expressions
# h1_element = doc.xpath('//h1').first
Step 4: Extract the Text
Once you have the element, you can call the text
method to get its text content:
text = h1_element.text
puts text
# Output: Welcome to My Website
If you want to extract all text content including the nested elements, Nokogiri will handle that for you with the text
method. If you want text from the immediate element only and not from its children, you can use the content
method on the Nokogiri::XML::NodeSet:
paragraph = doc.css('p').first
text_with_children = paragraph.text
puts text_with_children
# Output: This is a paragraph with a link.
# To get text from the immediate element only (Nokogiri >= 1.4.0)
immediate_text = paragraph.children.find { |child| child.text? }.content
puts immediate_text
# Output: This is a paragraph with
Full Example
Here's a full example putting all the steps together:
require 'nokogiri'
html_content = <<-HTML
<html>
<body>
<h1>Welcome to My Website</h1>
<p>This is a paragraph with <a href="#">a link</a>.</p>
</body>
</html>
HTML
# Parse the HTML
doc = Nokogiri::HTML(html_content)
# Find the element
h1_element = doc.css('h1').first
# Extract the text
text = h1_element.text
puts text
This script will output the text contained within the <h1>
tag of the HTML content. Remember to handle cases where the element might not exist to avoid nil
errors when calling .text
on an element that doesn't exist.