Nokogiri is Ruby's most popular gem for parsing HTML and XML documents. Selecting elements by class or ID is fundamental for web scraping, and Nokogiri provides two powerful methods: CSS selectors and XPath expressions.
Installation and Setup
First, ensure Nokogiri is installed:
gem install nokogiri
# or add to your Gemfile:
# gem 'nokogiri'
Selecting Elements by Class
Basic Class Selection
Use the CSS selector .class-name
to select elements with a specific class:
require 'nokogiri'
require 'open-uri'
# Example HTML content
html_content = <<-HTML
<!DOCTYPE html>
<html>
<body>
<div class="user-profile">
<h3>John Doe</h3>
<p>Software Developer</p>
</div>
<div class="user-profile active">
<h3>Jane Smith</h3>
<p>Data Scientist</p>
</div>
<div class="admin-profile">
<h3>Admin User</h3>
</div>
</body>
</html>
HTML
# Parse the HTML content
doc = Nokogiri::HTML(html_content)
# Select all elements with class "user-profile"
user_profiles = doc.css('.user-profile')
user_profiles.each do |profile|
name = profile.css('h3').text
role = profile.css('p').text
puts "#{name}: #{role}"
end
# Output:
# John Doe: Software Developer
# Jane Smith: Data Scientist
Advanced Class Selection
# Select elements with multiple classes
profiles_with_active = doc.css('.user-profile.active')
# Select elements containing a class (partial match)
profiles_containing_user = doc.css('[class*="user"]')
# Select first element with class
first_profile = doc.css('.user-profile').first
# Check if element has specific class
if first_profile['class'].include?('user-profile')
puts "Element has user-profile class"
end
Selecting Elements by ID
Basic ID Selection
Use the CSS selector #id
to select an element with a specific ID:
html_content = <<-HTML
<!DOCTYPE html>
<html>
<body>
<header id="main-header">
<h1>Welcome to My Website</h1>
<nav id="main-nav">
<ul>
<li><a href="/">Home</a></li>
<li><a href="/about">About</a></li>
</ul>
</nav>
</header>
<main id="content">
<p>Main content goes here</p>
</main>
</body>
</html>
HTML
doc = Nokogiri::HTML(html_content)
# Select element by ID
header = doc.css('#main-header')
puts header.css('h1').text
# Output: Welcome to My Website
# Select navigation by ID and extract links
nav = doc.css('#main-nav')
links = nav.css('a')
links.each do |link|
puts "#{link.text}: #{link['href']}"
end
# Output:
# Home: /
# About: /about
Working with ID Attributes
# Get element by ID using at_css (returns first match)
content = doc.at_css('#content')
puts content.text.strip
# Check if element exists
if doc.at_css('#sidebar')
puts "Sidebar found"
else
puts "Sidebar not found"
end
# Get ID attribute value
header_id = doc.css('header').first['id']
puts "Header ID: #{header_id}"
Using XPath Expressions
XPath provides more powerful selection capabilities:
# Select elements by class using XPath
user_profiles = doc.xpath('//div[@class="user-profile"]')
# Select elements containing a class
profiles_with_user = doc.xpath('//div[contains(@class, "user")]')
# Select element by ID using XPath
header = doc.xpath('//header[@id="main-header"]')
# More complex XPath: select div with class user-profile that contains an h3
profiles_with_h3 = doc.xpath('//div[@class="user-profile"][.//h3]')
# Select by partial class match
active_profiles = doc.xpath('//div[contains(@class, "active")]')
# Select by multiple conditions
specific_profile = doc.xpath('//div[@class="user-profile" and contains(., "Jane")]')
Real-World Web Scraping Example
Here's a practical example scraping from a live website:
require 'nokogiri'
require 'open-uri'
begin
# Scrape from a website (replace with actual URL)
url = "https://example.com"
doc = Nokogiri::HTML(URI.open(url))
# Select articles by class
articles = doc.css('.article-item')
articles.each do |article|
# Extract title from h2 with class 'title'
title = article.css('.title').text.strip
# Extract author from element with ID pattern
author_element = article.at_css('[id^="author-"]')
author = author_element ? author_element.text : "Unknown"
# Extract date from class
date = article.css('.publish-date').text.strip
puts "Title: #{title}"
puts "Author: #{author}"
puts "Date: #{date}"
puts "---"
end
rescue OpenURI::HTTPError => e
puts "Error fetching webpage: #{e.message}"
rescue => e
puts "Error parsing HTML: #{e.message}"
end
Performance and Best Practices
Efficient Selection Methods
# Use at_css for single elements (faster)
title = doc.at_css('#page-title')
# Use css for multiple elements
all_links = doc.css('a.external-link')
# Combine selectors for efficiency
user_names = doc.css('.user-profile h3')
# Cache frequently used selectors
main_content = doc.at_css('#main-content')
paragraphs = main_content.css('p') if main_content
Error Handling
# Always check if elements exist
profile = doc.at_css('.user-profile')
if profile
name = profile.css('h3').text
puts "User: #{name}"
else
puts "No user profile found"
end
# Use safe navigation with Ruby 2.3+
name = doc.at_css('.user-profile')&.css('h3')&.text
puts "User: #{name}" if name
# Handle missing attributes
link = doc.at_css('a')
href = link['href'] if link && link['href']
CSS Selectors vs XPath: When to Use Which
Use CSS Selectors When:
- Simple class/ID selection
- Familiar with CSS syntax
- Better readability for simple queries
- Slightly better performance for basic selections
Use XPath When:
- Complex conditional logic needed
- Navigating parent/sibling relationships
- Text content matching required
- Advanced filtering capabilities needed
# CSS: Simple and readable
products = doc.css('.product.featured')
# XPath: More powerful for complex conditions
expensive_products = doc.xpath('//div[@class="product"][.//span[@class="price" and number(.) > 100]]')
Common Patterns and Solutions
Multiple Classes
# Element with both classes
doc.css('.product.featured')
# Element with any of the classes
doc.css('.product, .featured')
Partial Matches
# CSS: Attribute contains
doc.css('[class*="user"]')
# XPath: More flexible partial matching
doc.xpath('//div[contains(@class, "user")]')
Nested Selection
# Find products within a specific container
container = doc.at_css('#products-container')
products = container.css('.product') if container
With these techniques, you can efficiently select and extract data from HTML documents using Nokogiri, whether you're building web scrapers, processing HTML content, or analyzing web pages programmatically.