Nokogiri is a powerful Ruby library for parsing HTML and XML documents. Extracting attributes from HTML elements is a common task in web scraping, and Nokogiri provides several methods to accomplish this efficiently.
Installation
First, install Nokogiri by adding it to your Gemfile or installing it directly:
# Gemfile
gem 'nokogiri'
# Direct installation
gem install nokogiri
Basic Attribute Extraction
Method 1: Hash-style Access ([]
)
The most common way to extract attributes is using hash-style access:
require 'nokogiri'
html_content = <<-HTML
<html>
<body>
<div id="main" class="content">
<a href="https://example.com" title="Example Website" data-id="123">
Click here
</a>
<img src="/images/logo.png" alt="Company Logo" width="200" height="100">
</div>
</body>
</html>
HTML
doc = Nokogiri::HTML(html_content)
# Extract link attributes
link = doc.css('a').first
puts link['href'] # => "https://example.com"
puts link['title'] # => "Example Website"
puts link['data-id'] # => "123"
# Extract image attributes
img = doc.css('img').first
puts img['src'] # => "/images/logo.png"
puts img['alt'] # => "Company Logo"
puts img['width'] # => "200"
Method 2: Using attr()
Method
The attr()
method provides the same functionality with a more explicit syntax:
link = doc.css('a').first
puts link.attr('href') # => "https://example.com"
puts link.attr('title') # => "Example Website"
puts link.attr('data-id') # => "123"
Advanced Attribute Extraction
Extracting All Attributes
Get all attributes from an element as a hash:
link = doc.css('a').first
attributes = link.attributes
attributes.each do |name, attr|
puts "#{name}: #{attr.value}"
end
# Output:
# href: https://example.com
# title: Example Website
# data-id: 123
Extracting Attributes from Multiple Elements
Process multiple elements and their attributes:
html_content = <<-HTML
<div class="products">
<div class="product" data-id="1" data-price="29.99">
<h3>Product A</h3>
<a href="/product/1">View Details</a>
</div>
<div class="product" data-id="2" data-price="39.99">
<h3>Product B</h3>
<a href="/product/2">View Details</a>
</div>
</div>
HTML
doc = Nokogiri::HTML(html_content)
# Extract data from all products
products = doc.css('.product')
products.each do |product|
id = product['data-id']
price = product['data-price']
link = product.css('a').first['href']
puts "Product ID: #{id}, Price: $#{price}, Link: #{link}"
end
Using XPath for Complex Selections
For more complex attribute extraction, use XPath:
# Extract all href attributes from links within a specific div
hrefs = doc.xpath('//div[@class="content"]//a/@href')
hrefs.each { |href| puts href.value }
# Extract attributes with conditions
expensive_products = doc.xpath('//div[@data-price > 30]/@data-id')
expensive_products.each { |id| puts "Expensive product ID: #{id.value}" }
Web Scraping Example
Here's a complete example of scraping attributes from a live website:
require 'nokogiri'
require 'net/http'
require 'uri'
def scrape_page_attributes(url)
uri = URI(url)
response = Net::HTTP.get_response(uri)
if response.code == '200'
doc = Nokogiri::HTML(response.body)
# Extract all image sources and alt texts
images = doc.css('img')
puts "Found #{images.count} images:"
images.each_with_index do |img, index|
src = img['src']
alt = img['alt'] || 'No alt text'
puts "#{index + 1}. #{src} (#{alt})"
end
# Extract all external links
external_links = doc.css('a[href^="http"]')
puts "\nFound #{external_links.count} external links:"
external_links.each do |link|
href = link['href']
title = link['title'] || link.text.strip
puts "- #{href} (#{title})"
end
else
puts "Failed to fetch page: #{response.code}"
end
end
# Usage
scrape_page_attributes('https://example.com')
Error Handling and Best Practices
Always handle cases where attributes might not exist:
link = doc.css('a').first
# Safe attribute extraction
href = link&.[]('href') || 'No href found'
title = link&.attr('title') || 'No title'
# Check if element exists before accessing attributes
if link
puts "Link found: #{link['href']}"
else
puts "No link found"
end
# Handle missing attributes gracefully
def safe_attr(element, attribute)
element&.[](attribute) || "#{attribute} not found"
end
puts safe_attr(link, 'href')
puts safe_attr(link, 'nonexistent')
Common Use Cases
Form Data Extraction
# Extract form input values and attributes
form_inputs = doc.css('input')
form_inputs.each do |input|
name = input['name']
value = input['value']
type = input['type']
puts "#{name}: #{value} (#{type})"
end
Meta Tag Information
# Extract meta tags
meta_tags = doc.css('meta')
meta_tags.each do |meta|
name = meta['name'] || meta['property']
content = meta['content']
puts "#{name}: #{content}" if name && content
end
Data Attributes for JavaScript
# Extract all data-* attributes
elements_with_data = doc.css('[data-id]')
elements_with_data.each do |element|
data_attrs = element.attributes.select { |name, _| name.start_with?('data-') }
data_attrs.each { |name, attr| puts "#{name}: #{attr.value}" }
end
Remember to respect robots.txt files and website terms of service when scraping. Consider adding delays between requests and handling rate limiting appropriately.