Table of contents

How do you use CSS pseudo-selectors with Cheerio?

CSS pseudo-selectors are powerful tools in Cheerio that allow you to select DOM elements based on their position, state, or relationship to other elements. While Cheerio supports many CSS pseudo-selectors, it's important to understand which ones are available and how to use them effectively for web scraping tasks.

What are CSS Pseudo-selectors?

CSS pseudo-selectors are keywords added to selectors that specify a special state or position of elements. They begin with a colon (:) and can target elements based on their structural position, content, or dynamic state. In Cheerio, these selectors help you pinpoint specific elements without relying on classes or IDs.

Supported Pseudo-selectors in Cheerio

Cheerio supports a substantial subset of CSS pseudo-selectors through its underlying CSS parser. Here are the most commonly used ones:

Structural Pseudo-selectors

const cheerio = require('cheerio');

const html = `
<ul>
  <li>First item</li>
  <li>Second item</li>
  <li>Third item</li>
  <li>Fourth item</li>
</ul>
`;

const $ = cheerio.load(html);

// Select first child
console.log($('li:first-child').text()); // "First item"

// Select last child
console.log($('li:last-child').text()); // "Fourth item"

// Select nth child (1-indexed)
console.log($('li:nth-child(2)').text()); // "Second item"

// Select nth child with formula
console.log($('li:nth-child(2n)').map((i, el) => $(el).text()).get()); // ["Second item", "Fourth item"]

Type-based Pseudo-selectors

const html = `
<div>
  <p>First paragraph</p>
  <span>A span</span>
  <p>Second paragraph</p>
  <p>Third paragraph</p>
</div>
`;

const $ = cheerio.load(html);

// Select first paragraph of its type
console.log($('p:first-of-type').text()); // "First paragraph"

// Select last paragraph of its type
console.log($('p:last-of-type').text()); // "Third paragraph"

// Select nth paragraph of its type
console.log($('p:nth-of-type(2)').text()); // "Second paragraph"

Content-based Pseudo-selectors

const html = `
<div>
  <p></p>
  <p>Some content</p>
  <p>   </p>
  <input type="text" disabled>
  <input type="text">
</div>
`;

const $ = cheerio.load(html);

// Select empty elements
console.log($('p:empty').length); // 1

// Select elements containing specific text
console.log($('p:contains("Some")').text()); // "Some content"

// Note: :disabled, :enabled, :checked are not supported in Cheerio
// Use attribute selectors instead
console.log($('input[disabled]').length); // 1

Advanced Pseudo-selector Techniques

Combining Multiple Pseudo-selectors

const html = `
<table>
  <tr><td>Header 1</td><td>Header 2</td></tr>
  <tr><td>Row 1 Col 1</td><td>Row 1 Col 2</td></tr>
  <tr><td>Row 2 Col 1</td><td>Row 2 Col 2</td></tr>
  <tr><td>Row 3 Col 1</td><td>Row 3 Col 2</td></tr>
</table>
`;

const $ = cheerio.load(html);

// Select every odd row except the first
const oddRows = $('tr:nth-child(odd):not(:first-child)');
console.log(oddRows.length); // 1

// Select last cell of each row
const lastCells = $('tr td:last-child');
console.log(lastCells.map((i, el) => $(el).text()).get());
// ["Header 2", "Row 1 Col 2", "Row 2 Col 2", "Row 3 Col 2"]

Using :not() Pseudo-selector

The :not() pseudo-selector is particularly useful for excluding specific elements:

const html = `
<div class="container">
  <p class="highlight">Important paragraph</p>
  <p>Regular paragraph</p>
  <p class="highlight">Another important paragraph</p>
  <span>A span element</span>
</div>
`;

const $ = cheerio.load(html);

// Select all paragraphs except those with 'highlight' class
const regularParagraphs = $('p:not(.highlight)');
console.log(regularParagraphs.text()); // "Regular paragraph"

// Select all elements except spans
const nonSpanElements = $('.container > :not(span)');
console.log(nonSpanElements.length); // 3

Practical Web Scraping Examples

Scraping Table Data with Pseudo-selectors

const cheerio = require('cheerio');
const axios = require('axios');

async function scrapeTableData(url) {
  try {
    const response = await axios.get(url);
    const $ = cheerio.load(response.data);

    const tableData = [];

    // Skip header row and process data rows
    $('table tr:not(:first-child)').each((index, row) => {
      const rowData = {};

      // Get all cells except the last one (assuming it's an action column)
      $(row).find('td:not(:last-child)').each((cellIndex, cell) => {
        const cellText = $(cell).text().trim();
        rowData[`column_${cellIndex}`] = cellText;
      });

      tableData.push(rowData);
    });

    return tableData;
  } catch (error) {
    console.error('Error scraping table data:', error);
    return [];
  }
}

Extracting Navigation Links

const html = `
<nav>
  <ul>
    <li><a href="/">Home</a></li>
    <li><a href="/about">About</a></li>
    <li><a href="/contact">Contact</a></li>
    <li><a href="/blog">Blog</a></li>
  </ul>
</nav>
`;

const $ = cheerio.load(html);

// Extract all navigation links except the first (home)
const navLinks = $('nav a:not(:first-child)').map((i, el) => ({
  text: $(el).text(),
  href: $(el).attr('href')
})).get();

console.log(navLinks);
// [
//   { text: 'About', href: '/about' },
//   { text: 'Contact', href: '/contact' },
//   { text: 'Blog', href: '/blog' }
// ]

Limitations and Workarounds

Unsupported Pseudo-selectors

Cheerio doesn't support all CSS pseudo-selectors, particularly those related to user interaction or dynamic states:

// These DON'T work in Cheerio:
// $('input:focus')     - Use attribute selectors instead
// $('a:hover')         - Not applicable in server-side parsing
// $('input:checked')   - Use $('input[checked]')
// $('button:disabled') - Use $('button[disabled]')

// Use attribute selectors as alternatives:
const $ = cheerio.load(html);

// Instead of :checked
const checkedInputs = $('input[checked]');

// Instead of :disabled
const disabledElements = $('[disabled]');

// Instead of :selected
const selectedOptions = $('option[selected]');

Working with Complex Selectors

For complex selection logic that pseudo-selectors can't handle, combine them with Cheerio's filtering methods:

const html = `
<div>
  <article data-category="tech">Tech Article 1</article>
  <article data-category="science">Science Article 1</article>
  <article data-category="tech">Tech Article 2</article>
  <article data-category="sports">Sports Article 1</article>
</div>
`;

const $ = cheerio.load(html);

// Select every second tech article
const techArticles = $('article[data-category="tech"]')
  .filter(':nth-child(odd)')
  .map((i, el) => $(el).text())
  .get();

console.log(techArticles); // ["Tech Article 1"]

Performance Considerations

When using pseudo-selectors extensively, consider these performance tips:

// More efficient: Use specific selectors
const specificElements = $('.container > div:first-child');

// Less efficient: Broad selection then filtering
const broadSelection = $('.container div').first();

// Cache commonly used selections
const $ = cheerio.load(html);
const listItems = $('li');
const firstItem = listItems.first();
const lastItem = listItems.last();
const middleItems = listItems.slice(1, -1);

Integration with Modern Scraping Workflows

CSS pseudo-selectors in Cheerio work excellently with other scraping tools. For instance, when you need to handle dynamic content that loads after page load, you can first use Puppeteer to render the page, then apply Cheerio with pseudo-selectors for precise element selection:

const puppeteer = require('puppeteer');
const cheerio = require('cheerio');

async function scrapeWithPseudoSelectors(url) {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto(url);

  // Wait for content to load
  await page.waitForSelector('.content');

  const html = await page.content();
  const $ = cheerio.load(html);

  // Use pseudo-selectors for precise extraction
  const importantData = $('.content article:nth-of-type(even)')
    .map((i, el) => $(el).text().trim())
    .get();

  await browser.close();
  return importantData;
}

Conclusion

CSS pseudo-selectors in Cheerio provide a powerful and intuitive way to select DOM elements based on their structural relationships and positions. While Cheerio doesn't support all pseudo-selectors available in browsers, the ones it does support cover the vast majority of web scraping use cases. By combining structural selectors like :nth-child(), :first-of-type(), and :not() with Cheerio's filtering methods, you can create sophisticated element selection strategies that make your web scraping code more maintainable and precise.

Remember to test your selectors thoroughly and consider fallback strategies when working with dynamic content that might benefit from tools like Puppeteer for handling complex interactions before applying Cheerio's pseudo-selectors for final data extraction.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon