Nokogiri is a Ruby gem that allows you to parse HTML as well as XML in Ruby. It's very powerful and provides a simple interface for navigating and manipulating the DOM of the document.
Here's a step-by-step guide on how to parse an HTML file using Nokogiri:
- Install the Nokogiri Gem
Before you can use Nokogiri, you need to install the gem. You can do this using the following command:
gem install nokogiri
- Require Nokogiri in Your Ruby Script
At the top of your Ruby script, you'll need to require the Nokogiri library so that you can use it:
require 'nokogiri'
- Read the HTML File
You'll need to read the HTML content that you want to parse. This can be done by reading from a file, or by fetching content directly from the web.
# If you have an HTML file, you can read it into a string
html_content = File.open('path/to/your/htmlfile.html') { |f| f.read }
# Or you can use a string with HTML content directly
# html_content = "<html>...</html>"
- Parse the HTML Content with Nokogiri
Once you have your HTML content in a string, you can parse it using Nokogiri's HTML
parser:
doc = Nokogiri::HTML(html_content)
- Navigate and Manipulate the Document
Now that you have a Nokogiri::HTML::Document
object, you can use Nokogiri's methods to navigate and manipulate the DOM.
Finding Elements by CSS Selectors
Use the
css
method to find elements using CSS selectors:titles = doc.css('h1, h2, h3') # Select all <h1>, <h2>, and <h3> tags titles.each do |title| puts title.content end
Finding Elements by XPath Selectors
Use the
xpath
method to find elements using XPath selectors:paragraphs = doc.xpath('//p') # Select all <p> tags paragraphs.each do |paragraph| puts paragraph.content end
Extracting Attributes
To get the value of an attribute from an element, you can use the
[]
method:links = doc.css('a') links.each do |link| puts link['href'] # Print the 'href' attribute end
Modifying Elements
You can also change the content of elements:
doc.css('h1').each do |h1| h1.content = "Modified Title" end
- Serialize the Document
After manipulating the document, you may want to convert it back to HTML:
html_output = doc.to_html
Here's a complete example that combines all the steps:
require 'nokogiri'
# Read the HTML file into a string
html_content = File.open('path/to/your/htmlfile.html') { |f| f.read }
# Parse the HTML content with Nokogiri
doc = Nokogiri::HTML(html_content)
# Find and print titles
titles = doc.css('h1, h2, h3')
titles.each do |title|
puts title.content
end
# Modify all <h1> tags
doc.css('h1').each do |h1|
h1.content = "Modified Title"
end
# Serialize the document to HTML
html_output = doc.to_html
Remember to replace 'path/to/your/htmlfile.html'
with the actual path to your HTML file. Nokogiri is a robust and flexible tool that can handle a wide variety of HTML parsing and manipulation tasks, making it a go-to library for many Ruby developers dealing with HTML data.