Nokogiri is a Ruby library for parsing HTML, XML, SAX, and Reader. It's a powerful tool for web scraping because it can easily navigate and search HTML documents. To extract attributes from HTML elements using Nokogiri, follow these steps:
- Install Nokogiri: If you haven't already installed Nokogiri, you can do so by running the following command in your terminal:
gem install nokogiri
Parse the HTML Document: You need to parse the HTML content with Nokogiri to create a navigable document object.
Locate the Element: Use Nokogiri's searching methods, such as
css
orxpath
, to find the HTML element from which you want to extract attributes.Extract the Attribute: Once you have located the element, you can access its attributes using methods like
[]
orattr
.
Here's an example in Ruby that demonstrates how to extract attributes from HTML elements using Nokogiri:
require 'nokogiri'
require 'open-uri'
# Sample HTML content
html_content = <<-HTML
<html>
<body>
<div id="main" class="content">
<a href="http://example.com" title="Example Website">Link to Example</a>
</div>
</body>
</html>
HTML
# Parse the HTML content
doc = Nokogiri::HTML(html_content)
# Locate the link element using CSS selectors
link = doc.css('a').first
# Extract the href attribute
href_value = link['href']
puts "The href attribute value is: #{href_value}"
# Extract the title attribute
title_value = link.attr('title')
puts "The title attribute value is: #{title_value}"
In this example, we first create a simple HTML structure as a string and parse it using Nokogiri::HTML
. Then, we locate the first anchor element (<a>
) using the css
method. We extract the href
attribute by accessing it like a hash key, and we extract the title
attribute by using the attr
method. Both attribute values are then printed to the console.
If you were to scrape a real web page, you could replace the html_content
variable with a string fetched from the web using open-uri
or another HTTP library like Net::HTTP
, HTTParty
, or Faraday
. Here's an example of how you could do that:
require 'nokogiri'
require 'open-uri'
# URL of the page to scrape
url = 'http://example.com'
# Fetch and parse the HTML content at the URL
doc = Nokogiri::HTML(URI.open(url))
# Locate the element and extract the attribute as shown earlier
Remember to respect the terms of service and robots.txt
of the websites you scrape, and consider the legal and ethical implications of web scraping.