What is the syntax for using CSS selectors in Nokogiri?

Nokogiri is a powerful Ruby library for parsing HTML and XML documents. The css method provides an intuitive way to select elements using CSS selector syntax, similar to how you would target elements in web browsers or stylesheets.

Basic CSS Selector Syntax

The fundamental syntax for using CSS selectors in Nokogiri is:

require 'nokogiri'
require 'net/http'

# Parse HTML document
html = '<div class="content"><p>Hello</p><p class="highlight">World</p></div>'
doc = Nokogiri::HTML(html)

# Basic selector syntax
elements = doc.css('selector')

Common CSS Selectors

Element Selectors

# Select all paragraph elements
paragraphs = doc.css('p')

# Select all div elements
divs = doc.css('div')

# Select all links
links = doc.css('a')

Class and ID Selectors

# Select elements with specific class
highlighted = doc.css('.highlight')
content_boxes = doc.css('.content-box')

# Select element with specific ID
header = doc.css('#header')
navigation = doc.css('#nav')

# Combine class and element
highlighted_paragraphs = doc.css('p.highlight')

Attribute Selectors

# Select elements with specific attributes
form_inputs = doc.css('input[type="text"]')
external_links = doc.css('a[href^="http"]')
required_fields = doc.css('input[required]')

# Attribute contains value
partial_match = doc.css('div[class*="card"]')

Advanced CSS Selectors

Descendant and Child Selectors

# Descendant selector (any level)
nested_links = doc.css('div a')

# Direct child selector
direct_children = doc.css('ul > li')

# Adjacent sibling selector
next_siblings = doc.css('h2 + p')

# General sibling selector
all_siblings = doc.css('h2 ~ p')

Pseudo-class Selectors

# Structural pseudo-classes
first_paragraph = doc.css('p:first-child')
last_item = doc.css('li:last-child')
odd_rows = doc.css('tr:nth-child(odd)')
even_rows = doc.css('tr:nth-child(even)')

# Position-based selectors
first_of_type = doc.css('p:first-of-type')
last_of_type = doc.css('h2:last-of-type')
nth_element = doc.css('div:nth-of-type(3)')

# Content-based selectors
empty_elements = doc.css('div:empty')

Working with NodeSet Results

The css method returns a Nokogiri::XML::NodeSet, which behaves like an array:

# Get all paragraphs
paragraphs = doc.css('p')

# Access individual elements
first_paragraph = paragraphs.first
last_paragraph = paragraphs.last
specific_paragraph = paragraphs[1]

# Iterate through results
paragraphs.each do |paragraph|
  puts paragraph.text
  puts paragraph['class'] # Get attribute value
end

# Check if elements were found
if paragraphs.any?
  puts "Found #{paragraphs.length} paragraphs"
end

# Convert to array if needed
paragraph_array = paragraphs.to_a

Chaining CSS Selectors

You can chain CSS selectors for more precise targeting:

# Select within a specific container
container = doc.css('#main-content').first
nested_links = container.css('a.external')

# Multiple chained selections
sidebar = doc.css('.sidebar').first
menu_items = sidebar.css('ul.menu li')
active_item = menu_items.css('.active').first

Practical Examples

Extracting Table Data

html = <<~HTML
  <table>
    <tr><th>Name</th><th>Age</th></tr>
    <tr><td>John</td><td>25</td></tr>
    <tr><td>Jane</td><td>30</td></tr>
  </table>
HTML

doc = Nokogiri::HTML(html)

# Extract table headers
headers = doc.css('th').map(&:text)

# Extract table rows
rows = doc.css('tr:not(:first-child)').map do |row|
  row.css('td').map(&:text)
end

Scraping Navigation Menus

# Select navigation links
nav_links = doc.css('nav ul li a')

# Extract link text and URLs
navigation = nav_links.map do |link|
  {
    text: link.text.strip,
    url: link['href']
  }
end

Form Element Selection

# Find all form inputs
inputs = doc.css('form input, form select, form textarea')

# Get required fields only
required_fields = doc.css('form [required]')

# Select specific input types
email_fields = doc.css('input[type="email"]')
checkboxes = doc.css('input[type="checkbox"]')

Performance Tips

  1. Use specific selectors: More specific selectors are generally faster
  2. Cache NodeSet results: Store frequently accessed elements
  3. Use at_css for single elements: More efficient than css.first
# Efficient single element selection
first_paragraph = doc.at_css('p')

# Cache frequently accessed elements
navigation = doc.css('nav ul li')

Important Notes

  • Case sensitivity: CSS selectors in Nokogiri are case-sensitive
  • NodeSet behavior: Results behave like arrays but have additional methods
  • Empty results: Always check if elements exist before accessing them
  • XML vs HTML: Nokogiri handles both, but XML is stricter about case sensitivity

The css method in Nokogiri provides a powerful and intuitive way to navigate and extract data from HTML and XML documents using familiar CSS selector syntax.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon