Table of contents

How can I combine multiple CSS selectors in Nokogiri?

Nokogiri provides powerful CSS selector support that allows you to combine multiple selectors to target specific elements with precision. Understanding how to effectively combine selectors is crucial for efficient web scraping and HTML parsing in Ruby applications.

Understanding CSS Selector Combinators

Nokogiri supports all standard CSS selector combinators, which are special characters that define relationships between elements. Here are the main types:

Descendant Combinator (Space)

The descendant combinator selects elements that are descendants of another element, regardless of how deeply nested they are.

require 'nokogiri'
require 'open-uri'

html = <<-HTML
<div class="container">
  <article>
    <h2>Article Title</h2>
    <div class="content">
      <p>First paragraph</p>
      <div class="nested">
        <p>Nested paragraph</p>
      </div>
    </div>
  </article>
</div>
HTML

doc = Nokogiri::HTML(html)

# Select all p elements inside container
paragraphs = doc.css('div.container p')
puts paragraphs.length  # Output: 2

# Select all p elements inside content divs
content_paragraphs = doc.css('div.content p')
puts content_paragraphs.length  # Output: 2

Child Combinator (>)

The child combinator selects direct children only, not deeper descendants.

# Select only direct p children of content div
direct_children = doc.css('div.content > p')
puts direct_children.length  # Output: 1 (only "First paragraph")

# Compare with descendant selector
all_descendants = doc.css('div.content p')
puts all_descendants.length  # Output: 2 (includes nested paragraph)

Adjacent Sibling Combinator (+)

Selects elements that immediately follow another element.

html = <<-HTML
<div>
  <h2>Heading</h2>
  <p>First paragraph after heading</p>
  <p>Second paragraph</p>
  <span>A span element</span>
</div>
HTML

doc = Nokogiri::HTML(html)

# Select p element immediately following h2
adjacent_p = doc.css('h2 + p')
puts adjacent_p.text  # Output: "First paragraph after heading"

General Sibling Combinator (~)

Selects all sibling elements that follow another element.

# Select all p elements that are siblings after h2
sibling_paragraphs = doc.css('h2 ~ p')
puts sibling_paragraphs.length  # Output: 2

Advanced Selector Combinations

Multiple Class Selectors

You can combine multiple class selectors to target elements with specific class combinations.

html = <<-HTML
<div class="card featured">Featured Card</div>
<div class="card">Regular Card</div>
<div class="featured">Featured Content</div>
HTML

doc = Nokogiri::HTML(html)

# Select elements with both 'card' and 'featured' classes
featured_cards = doc.css('div.card.featured')
puts featured_cards.text  # Output: "Featured Card"

Attribute and Class Combinations

Combine attribute selectors with class selectors for precise targeting.

html = <<-HTML
<input type="text" class="form-control" name="username">
<input type="password" class="form-control" name="password">
<input type="submit" class="btn primary" value="Login">
HTML

doc = Nokogiri::HTML(html)

# Select text inputs with form-control class
text_inputs = doc.css('input[type="text"].form-control')
puts text_inputs.length  # Output: 1

# Select inputs with specific name and class
username_field = doc.css('input[name="username"].form-control')
puts username_field.first['type']  # Output: "text"

Complex Selector Patterns

Pseudo-selectors with Combinators

Nokogiri supports CSS pseudo-selectors that can be combined with other selectors.

html = <<-HTML
<ul class="menu">
  <li>Home</li>
  <li>About</li>
  <li>Services</li>
  <li>Contact</li>
</ul>
HTML

doc = Nokogiri::HTML(html)

# Select first list item in menu
first_item = doc.css('ul.menu li:first-child')
puts first_item.text  # Output: "Home"

# Select last list item
last_item = doc.css('ul.menu li:last-child')
puts last_item.text  # Output: "Contact"

# Select nth item (3rd item, 1-indexed)
third_item = doc.css('ul.menu li:nth-child(3)')
puts third_item.text  # Output: "Services"

Negation Pseudo-class

Use the :not() pseudo-class to exclude specific elements.

html = <<-HTML
<div class="content">
  <p>Regular paragraph</p>
  <p class="highlight">Highlighted paragraph</p>
  <p>Another regular paragraph</p>
</div>
HTML

doc = Nokogiri::HTML(html)

# Select all p elements except those with highlight class
regular_paragraphs = doc.css('p:not(.highlight)')
puts regular_paragraphs.length  # Output: 2

Practical Examples for Web Scraping

Scraping Product Information

Here's a practical example of combining selectors to scrape product information:

require 'nokogiri'
require 'open-uri'

def scrape_products(html)
  doc = Nokogiri::HTML(html)
  products = []

  # Select product containers with specific class and data attributes
  product_elements = doc.css('div.product-card[data-available="true"]')

  product_elements.each do |product|
    # Combine selectors to extract specific information
    title = product.css('h3.product-title a').text.strip
    price = product.css('span.price.current').text.strip
    rating = product.css('div.rating span.stars').length

    # Select image with specific attributes
    image_url = product.css('img.product-image[src]').first&.[]('src')

    products << {
      title: title,
      price: price,
      rating: rating,
      image_url: image_url
    }
  end

  products
end

Extracting Navigation Links

def extract_navigation_links(html)
  doc = Nokogiri::HTML(html)

  # Select navigation links with multiple criteria
  nav_links = doc.css('nav.main-navigation ul li a[href]:not([href="#"])')

  links = nav_links.map do |link|
    {
      text: link.text.strip,
      url: link['href'],
      active: link['class']&.include?('active') || false
    }
  end

  links
end

Multiple Selector Grouping

You can group multiple selectors using commas to apply the same operation to different elements:

html = <<-HTML
<div>
  <h1>Main Title</h1>
  <h2>Subtitle</h2>
  <p class="intro">Introduction paragraph</p>
  <span class="highlight">Important text</span>
</div>
HTML

doc = Nokogiri::HTML(html)

# Select all headings and highlighted elements
mixed_elements = doc.css('h1, h2, .highlight')
puts mixed_elements.length  # Output: 3

# Extract text from multiple element types
important_text = doc.css('h1, h2, p.intro, .highlight').map(&:text)
puts important_text
# Output: ["Main Title", "Subtitle", "Introduction paragraph", "Important text"]

Performance Considerations

When combining multiple selectors, consider performance implications:

Efficient Selector Strategies

# More efficient: Use specific selectors
efficient = doc.css('div.content > p.summary')

# Less efficient: Broad selectors with filtering
inefficient = doc.css('p').select { |p| p.parent.name == 'div' && p.parent['class'] == 'content' }

# Optimize by caching parent selections
content_div = doc.css('div.content').first
if content_div
  summary_paragraphs = content_div.css('> p.summary')
end

Selector Scope Limitation

# Limit scope to improve performance
container = doc.css('#main-content').first
if container
  # Search within container only
  articles = container.css('article.post')
  article_titles = container.css('article.post h2.title')
end

Working with Dynamic Attributes

Combine selectors to target elements with dynamic or partial attributes:

html = <<-HTML
<div>
  <button id="btn-primary-123" class="button primary">Primary Button</button>
  <button id="btn-secondary-456" class="button secondary">Secondary Button</button>
  <input type="text" data-validation="required email" placeholder="Email">
</div>
HTML

doc = Nokogiri::HTML(html)

# Select buttons with IDs starting with "btn-primary"
primary_buttons = doc.css('button[id^="btn-primary"]')

# Select elements with multiple data attributes
email_inputs = doc.css('input[data-validation*="email"][data-validation*="required"]')

# Combine class and attribute selectors
primary_button_elements = doc.css('button.button.primary[id]')

Error Handling and Validation

When working with complex selectors, implement proper error handling:

def safe_extract_content(html, selector)
  doc = Nokogiri::HTML(html)

  begin
    elements = doc.css(selector)
    return elements.map(&:text) unless elements.empty?
  rescue => e
    puts "Error with selector '#{selector}': #{e.message}"
  end

  []
end

# Usage with complex selectors
content = safe_extract_content(html, 'div.article-content > p:not(.advertisement)')

Best Practices

  1. Start Simple: Begin with basic selectors and add complexity incrementally
  2. Test Selectors: Use browser developer tools to test selectors before implementing
  3. Use Specific Selectors: More specific selectors are generally faster and more reliable
  4. Handle Missing Elements: Always check if elements exist before accessing their properties
  5. Combine Judiciously: While powerful, overly complex selectors can be hard to maintain
  6. Document Complex Selectors: Add comments explaining complex selector logic

Integration with Other Tools

When dealing with dynamic content that requires JavaScript execution, you might need to consider browser automation tools. For comprehensive web scraping solutions that handle JavaScript-rendered content, you can explore advanced scraping techniques for single-page applications.

Understanding how to combine CSS selectors effectively in Nokogiri allows you to write more precise and efficient web scraping code. Whether you're extracting structured data from e-commerce sites or parsing complex HTML documents, mastering these selector combination techniques will significantly improve your Ruby-based web scraping projects.

For scenarios requiring more complex DOM interactions or when dealing with heavily dynamic content, consider complementing Nokogiri with tools that can handle JavaScript-heavy websites for a complete web scraping solution.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon