How do I select elements by class or ID using Nokogiri in Ruby?

Nokogiri is Ruby's most popular gem for parsing HTML and XML documents. Selecting elements by class or ID is fundamental for web scraping, and Nokogiri provides two powerful methods: CSS selectors and XPath expressions.

Installation and Setup

First, ensure Nokogiri is installed:

gem install nokogiri
# or add to your Gemfile:
# gem 'nokogiri'

Selecting Elements by Class

Basic Class Selection

Use the CSS selector .class-name to select elements with a specific class:

require 'nokogiri'
require 'open-uri'

# Example HTML content
html_content = <<-HTML
<!DOCTYPE html>
<html>
<body>
  <div class="user-profile">
    <h3>John Doe</h3>
    <p>Software Developer</p>
  </div>
  <div class="user-profile active">
    <h3>Jane Smith</h3>
    <p>Data Scientist</p>
  </div>
  <div class="admin-profile">
    <h3>Admin User</h3>
  </div>
</body>
</html>
HTML

# Parse the HTML content
doc = Nokogiri::HTML(html_content)

# Select all elements with class "user-profile"
user_profiles = doc.css('.user-profile')

user_profiles.each do |profile|
  name = profile.css('h3').text
  role = profile.css('p').text
  puts "#{name}: #{role}"
end
# Output:
# John Doe: Software Developer
# Jane Smith: Data Scientist

Advanced Class Selection

# Select elements with multiple classes
profiles_with_active = doc.css('.user-profile.active')

# Select elements containing a class (partial match)
profiles_containing_user = doc.css('[class*="user"]')

# Select first element with class
first_profile = doc.css('.user-profile').first

# Check if element has specific class
if first_profile['class'].include?('user-profile')
  puts "Element has user-profile class"
end

Selecting Elements by ID

Basic ID Selection

Use the CSS selector #id to select an element with a specific ID:

html_content = <<-HTML
<!DOCTYPE html>
<html>
<body>
  <header id="main-header">
    <h1>Welcome to My Website</h1>
    <nav id="main-nav">
      <ul>
        <li><a href="/">Home</a></li>
        <li><a href="/about">About</a></li>
      </ul>
    </nav>
  </header>
  <main id="content">
    <p>Main content goes here</p>
  </main>
</body>
</html>
HTML

doc = Nokogiri::HTML(html_content)

# Select element by ID
header = doc.css('#main-header')
puts header.css('h1').text
# Output: Welcome to My Website

# Select navigation by ID and extract links
nav = doc.css('#main-nav')
links = nav.css('a')
links.each do |link|
  puts "#{link.text}: #{link['href']}"
end
# Output:
# Home: /
# About: /about

Working with ID Attributes

# Get element by ID using at_css (returns first match)
content = doc.at_css('#content')
puts content.text.strip

# Check if element exists
if doc.at_css('#sidebar')
  puts "Sidebar found"
else
  puts "Sidebar not found"
end

# Get ID attribute value
header_id = doc.css('header').first['id']
puts "Header ID: #{header_id}"

Using XPath Expressions

XPath provides more powerful selection capabilities:

# Select elements by class using XPath
user_profiles = doc.xpath('//div[@class="user-profile"]')

# Select elements containing a class
profiles_with_user = doc.xpath('//div[contains(@class, "user")]')

# Select element by ID using XPath
header = doc.xpath('//header[@id="main-header"]')

# More complex XPath: select div with class user-profile that contains an h3
profiles_with_h3 = doc.xpath('//div[@class="user-profile"][.//h3]')

# Select by partial class match
active_profiles = doc.xpath('//div[contains(@class, "active")]')

# Select by multiple conditions
specific_profile = doc.xpath('//div[@class="user-profile" and contains(., "Jane")]')

Real-World Web Scraping Example

Here's a practical example scraping from a live website:

require 'nokogiri'
require 'open-uri'

begin
  # Scrape from a website (replace with actual URL)
  url = "https://example.com"
  doc = Nokogiri::HTML(URI.open(url))

  # Select articles by class
  articles = doc.css('.article-item')

  articles.each do |article|
    # Extract title from h2 with class 'title'
    title = article.css('.title').text.strip

    # Extract author from element with ID pattern
    author_element = article.at_css('[id^="author-"]')
    author = author_element ? author_element.text : "Unknown"

    # Extract date from class
    date = article.css('.publish-date').text.strip

    puts "Title: #{title}"
    puts "Author: #{author}"
    puts "Date: #{date}"
    puts "---"
  end

rescue OpenURI::HTTPError => e
  puts "Error fetching webpage: #{e.message}"
rescue => e
  puts "Error parsing HTML: #{e.message}"
end

Performance and Best Practices

Efficient Selection Methods

# Use at_css for single elements (faster)
title = doc.at_css('#page-title')

# Use css for multiple elements
all_links = doc.css('a.external-link')

# Combine selectors for efficiency
user_names = doc.css('.user-profile h3')

# Cache frequently used selectors
main_content = doc.at_css('#main-content')
paragraphs = main_content.css('p') if main_content

Error Handling

# Always check if elements exist
profile = doc.at_css('.user-profile')
if profile
  name = profile.css('h3').text
  puts "User: #{name}"
else
  puts "No user profile found"
end

# Use safe navigation with Ruby 2.3+
name = doc.at_css('.user-profile')&.css('h3')&.text
puts "User: #{name}" if name

# Handle missing attributes
link = doc.at_css('a')
href = link['href'] if link && link['href']

CSS Selectors vs XPath: When to Use Which

Use CSS Selectors When:

Simple class/ID selection
Familiar with CSS syntax
Better readability for simple queries
Slightly better performance for basic selections

Use XPath When:

Complex conditional logic needed
Navigating parent/sibling relationships
Text content matching required
Advanced filtering capabilities needed

# CSS: Simple and readable
products = doc.css('.product.featured')

# XPath: More powerful for complex conditions
expensive_products = doc.xpath('//div[@class="product"][.//span[@class="price" and number(.) > 100]]')

Common Patterns and Solutions

Multiple Classes

# Element with both classes
doc.css('.product.featured')

# Element with any of the classes  
doc.css('.product, .featured')

Partial Matches

# CSS: Attribute contains
doc.css('[class*="user"]')

# XPath: More flexible partial matching
doc.xpath('//div[contains(@class, "user")]')

Nested Selection

# Find products within a specific container
container = doc.at_css('#products-container')
products = container.css('.product') if container

With these techniques, you can efficiently select and extract data from HTML documents using Nokogiri, whether you're building web scrapers, processing HTML content, or analyzing web pages programmatically.

Table of contents