How can I parse an HTML file using Nokogiri?

Nokogiri is a Ruby gem that allows you to parse HTML as well as XML in Ruby. It's very powerful and provides a simple interface for navigating and manipulating the DOM of the document.

Here's a step-by-step guide on how to parse an HTML file using Nokogiri:

  1. Install the Nokogiri Gem

Before you can use Nokogiri, you need to install the gem. You can do this using the following command:

   gem install nokogiri
  1. Require Nokogiri in Your Ruby Script

At the top of your Ruby script, you'll need to require the Nokogiri library so that you can use it:

   require 'nokogiri'
  1. Read the HTML File

You'll need to read the HTML content that you want to parse. This can be done by reading from a file, or by fetching content directly from the web.

   # If you have an HTML file, you can read it into a string
   html_content = File.open('path/to/your/htmlfile.html') { |f| f.read }
   # Or you can use a string with HTML content directly
   # html_content = "<html>...</html>"
  1. Parse the HTML Content with Nokogiri

Once you have your HTML content in a string, you can parse it using Nokogiri's HTML parser:

   doc = Nokogiri::HTML(html_content)
  1. Navigate and Manipulate the Document

Now that you have a Nokogiri::HTML::Document object, you can use Nokogiri's methods to navigate and manipulate the DOM.

  • Finding Elements by CSS Selectors

    Use the css method to find elements using CSS selectors:

     titles = doc.css('h1, h2, h3') # Select all <h1>, <h2>, and <h3> tags
     titles.each do |title|
       puts title.content
     end
    
  • Finding Elements by XPath Selectors

    Use the xpath method to find elements using XPath selectors:

     paragraphs = doc.xpath('//p') # Select all <p> tags
     paragraphs.each do |paragraph|
       puts paragraph.content
     end
    
  • Extracting Attributes

    To get the value of an attribute from an element, you can use the [] method:

     links = doc.css('a')
     links.each do |link|
       puts link['href'] # Print the 'href' attribute
     end
    
  • Modifying Elements

    You can also change the content of elements:

     doc.css('h1').each do |h1|
       h1.content = "Modified Title"
     end
    
  1. Serialize the Document

After manipulating the document, you may want to convert it back to HTML:

   html_output = doc.to_html

Here's a complete example that combines all the steps:

require 'nokogiri'

# Read the HTML file into a string
html_content = File.open('path/to/your/htmlfile.html') { |f| f.read }

# Parse the HTML content with Nokogiri
doc = Nokogiri::HTML(html_content)

# Find and print titles
titles = doc.css('h1, h2, h3')
titles.each do |title|
  puts title.content
end

# Modify all <h1> tags
doc.css('h1').each do |h1|
  h1.content = "Modified Title"
end

# Serialize the document to HTML
html_output = doc.to_html

Remember to replace 'path/to/your/htmlfile.html' with the actual path to your HTML file. Nokogiri is a robust and flexible tool that can handle a wide variety of HTML parsing and manipulation tasks, making it a go-to library for many Ruby developers dealing with HTML data.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon