How can I combine multiple CSS selectors in Nokogiri?
Nokogiri provides powerful CSS selector support that allows you to combine multiple selectors to target specific elements with precision. Understanding how to effectively combine selectors is crucial for efficient web scraping and HTML parsing in Ruby applications.
Understanding CSS Selector Combinators
Nokogiri supports all standard CSS selector combinators, which are special characters that define relationships between elements. Here are the main types:
Descendant Combinator (Space)
The descendant combinator selects elements that are descendants of another element, regardless of how deeply nested they are.
require 'nokogiri'
require 'open-uri'
html = <<-HTML
<div class="container">
<article>
<h2>Article Title</h2>
<div class="content">
<p>First paragraph</p>
<div class="nested">
<p>Nested paragraph</p>
</div>
</div>
</article>
</div>
HTML
doc = Nokogiri::HTML(html)
# Select all p elements inside container
paragraphs = doc.css('div.container p')
puts paragraphs.length # Output: 2
# Select all p elements inside content divs
content_paragraphs = doc.css('div.content p')
puts content_paragraphs.length # Output: 2
Child Combinator (>)
The child combinator selects direct children only, not deeper descendants.
# Select only direct p children of content div
direct_children = doc.css('div.content > p')
puts direct_children.length # Output: 1 (only "First paragraph")
# Compare with descendant selector
all_descendants = doc.css('div.content p')
puts all_descendants.length # Output: 2 (includes nested paragraph)
Adjacent Sibling Combinator (+)
Selects elements that immediately follow another element.
html = <<-HTML
<div>
<h2>Heading</h2>
<p>First paragraph after heading</p>
<p>Second paragraph</p>
<span>A span element</span>
</div>
HTML
doc = Nokogiri::HTML(html)
# Select p element immediately following h2
adjacent_p = doc.css('h2 + p')
puts adjacent_p.text # Output: "First paragraph after heading"
General Sibling Combinator (~)
Selects all sibling elements that follow another element.
# Select all p elements that are siblings after h2
sibling_paragraphs = doc.css('h2 ~ p')
puts sibling_paragraphs.length # Output: 2
Advanced Selector Combinations
Multiple Class Selectors
You can combine multiple class selectors to target elements with specific class combinations.
html = <<-HTML
<div class="card featured">Featured Card</div>
<div class="card">Regular Card</div>
<div class="featured">Featured Content</div>
HTML
doc = Nokogiri::HTML(html)
# Select elements with both 'card' and 'featured' classes
featured_cards = doc.css('div.card.featured')
puts featured_cards.text # Output: "Featured Card"
Attribute and Class Combinations
Combine attribute selectors with class selectors for precise targeting.
html = <<-HTML
<input type="text" class="form-control" name="username">
<input type="password" class="form-control" name="password">
<input type="submit" class="btn primary" value="Login">
HTML
doc = Nokogiri::HTML(html)
# Select text inputs with form-control class
text_inputs = doc.css('input[type="text"].form-control')
puts text_inputs.length # Output: 1
# Select inputs with specific name and class
username_field = doc.css('input[name="username"].form-control')
puts username_field.first['type'] # Output: "text"
Complex Selector Patterns
Pseudo-selectors with Combinators
Nokogiri supports CSS pseudo-selectors that can be combined with other selectors.
html = <<-HTML
<ul class="menu">
<li>Home</li>
<li>About</li>
<li>Services</li>
<li>Contact</li>
</ul>
HTML
doc = Nokogiri::HTML(html)
# Select first list item in menu
first_item = doc.css('ul.menu li:first-child')
puts first_item.text # Output: "Home"
# Select last list item
last_item = doc.css('ul.menu li:last-child')
puts last_item.text # Output: "Contact"
# Select nth item (3rd item, 1-indexed)
third_item = doc.css('ul.menu li:nth-child(3)')
puts third_item.text # Output: "Services"
Negation Pseudo-class
Use the :not()
pseudo-class to exclude specific elements.
html = <<-HTML
<div class="content">
<p>Regular paragraph</p>
<p class="highlight">Highlighted paragraph</p>
<p>Another regular paragraph</p>
</div>
HTML
doc = Nokogiri::HTML(html)
# Select all p elements except those with highlight class
regular_paragraphs = doc.css('p:not(.highlight)')
puts regular_paragraphs.length # Output: 2
Practical Examples for Web Scraping
Scraping Product Information
Here's a practical example of combining selectors to scrape product information:
require 'nokogiri'
require 'open-uri'
def scrape_products(html)
doc = Nokogiri::HTML(html)
products = []
# Select product containers with specific class and data attributes
product_elements = doc.css('div.product-card[data-available="true"]')
product_elements.each do |product|
# Combine selectors to extract specific information
title = product.css('h3.product-title a').text.strip
price = product.css('span.price.current').text.strip
rating = product.css('div.rating span.stars').length
# Select image with specific attributes
image_url = product.css('img.product-image[src]').first&.[]('src')
products << {
title: title,
price: price,
rating: rating,
image_url: image_url
}
end
products
end
Extracting Navigation Links
def extract_navigation_links(html)
doc = Nokogiri::HTML(html)
# Select navigation links with multiple criteria
nav_links = doc.css('nav.main-navigation ul li a[href]:not([href="#"])')
links = nav_links.map do |link|
{
text: link.text.strip,
url: link['href'],
active: link['class']&.include?('active') || false
}
end
links
end
Multiple Selector Grouping
You can group multiple selectors using commas to apply the same operation to different elements:
html = <<-HTML
<div>
<h1>Main Title</h1>
<h2>Subtitle</h2>
<p class="intro">Introduction paragraph</p>
<span class="highlight">Important text</span>
</div>
HTML
doc = Nokogiri::HTML(html)
# Select all headings and highlighted elements
mixed_elements = doc.css('h1, h2, .highlight')
puts mixed_elements.length # Output: 3
# Extract text from multiple element types
important_text = doc.css('h1, h2, p.intro, .highlight').map(&:text)
puts important_text
# Output: ["Main Title", "Subtitle", "Introduction paragraph", "Important text"]
Performance Considerations
When combining multiple selectors, consider performance implications:
Efficient Selector Strategies
# More efficient: Use specific selectors
efficient = doc.css('div.content > p.summary')
# Less efficient: Broad selectors with filtering
inefficient = doc.css('p').select { |p| p.parent.name == 'div' && p.parent['class'] == 'content' }
# Optimize by caching parent selections
content_div = doc.css('div.content').first
if content_div
summary_paragraphs = content_div.css('> p.summary')
end
Selector Scope Limitation
# Limit scope to improve performance
container = doc.css('#main-content').first
if container
# Search within container only
articles = container.css('article.post')
article_titles = container.css('article.post h2.title')
end
Working with Dynamic Attributes
Combine selectors to target elements with dynamic or partial attributes:
html = <<-HTML
<div>
<button id="btn-primary-123" class="button primary">Primary Button</button>
<button id="btn-secondary-456" class="button secondary">Secondary Button</button>
<input type="text" data-validation="required email" placeholder="Email">
</div>
HTML
doc = Nokogiri::HTML(html)
# Select buttons with IDs starting with "btn-primary"
primary_buttons = doc.css('button[id^="btn-primary"]')
# Select elements with multiple data attributes
email_inputs = doc.css('input[data-validation*="email"][data-validation*="required"]')
# Combine class and attribute selectors
primary_button_elements = doc.css('button.button.primary[id]')
Error Handling and Validation
When working with complex selectors, implement proper error handling:
def safe_extract_content(html, selector)
doc = Nokogiri::HTML(html)
begin
elements = doc.css(selector)
return elements.map(&:text) unless elements.empty?
rescue => e
puts "Error with selector '#{selector}': #{e.message}"
end
[]
end
# Usage with complex selectors
content = safe_extract_content(html, 'div.article-content > p:not(.advertisement)')
Best Practices
- Start Simple: Begin with basic selectors and add complexity incrementally
- Test Selectors: Use browser developer tools to test selectors before implementing
- Use Specific Selectors: More specific selectors are generally faster and more reliable
- Handle Missing Elements: Always check if elements exist before accessing their properties
- Combine Judiciously: While powerful, overly complex selectors can be hard to maintain
- Document Complex Selectors: Add comments explaining complex selector logic
Integration with Other Tools
When dealing with dynamic content that requires JavaScript execution, you might need to consider browser automation tools. For comprehensive web scraping solutions that handle JavaScript-rendered content, you can explore advanced scraping techniques for single-page applications.
Understanding how to combine CSS selectors effectively in Nokogiri allows you to write more precise and efficient web scraping code. Whether you're extracting structured data from e-commerce sites or parsing complex HTML documents, mastering these selector combination techniques will significantly improve your Ruby-based web scraping projects.
For scenarios requiring more complex DOM interactions or when dealing with heavily dynamic content, consider complementing Nokogiri with tools that can handle JavaScript-heavy websites for a complete web scraping solution.