How do you select elements using Cheerio?

Cheerio is a fast, flexible, and lean implementation of core jQuery designed specifically for the server to parse, manipulate, and render HTML. It's a popular tool among Node.js developers for web scraping because it uses a very familiar jQuery syntax.

To select elements using Cheerio, you follow much the same process as you would in a browser with jQuery. Let's go through the basic steps:

Installation

First, you need to install Cheerio via npm or yarn if you haven't already:

npm install cheerio

or

yarn add cheerio

Usage

Once Cheerio is installed, you can load an HTML document and start selecting elements. Here's a basic example in JavaScript:

const cheerio = require('cheerio');

// Sample HTML
const html = `
<html>
<head>
  <title>Test Page</title>
</head>
<body>
  <h1>Welcome to My Test Page</h1>
  <div class="content">
    <p>This is a paragraph inside a content div.</p>
    <ul class="list">
      <li class="item">Item 1</li>
      <li class="item">Item 2</li>
      <li class="item">Item 3</li>
    </ul>
  </div>
</body>
</html>
`;

// Load HTML into Cheerio
const $ = cheerio.load(html);

// Select elements using CSS selectors
const title = $('title').text();
console.log(title); // Output: Test Page

const listItems = $('.list .item');
listItems.each(function (i, el) {
  // 'this' is the current element in the loop
  console.log($(this).text()); // Output: Item 1, Item 2, Item 3
});

// You can also manipulate elements
$('h1').text('Updated Title');
console.log($('h1').text()); // Output: Updated Title

Selecting Elements

Cheerio uses CSS selectors to target elements, and it provides several methods to traverse and manipulate the DOM:

  • $(selector): This is the primary function to query elements in the DOM. It works just like jQuery's $.
  • .find(selector): Searches for descendant elements that match the selector.
  • .parent(), .parents(selector): Gets the parent or ancestors of each element in the set of matched elements, optionally filtered by a selector.
  • .children(selector): Gets the children of each element in the set of matched elements, optionally filtered by a selector.
  • .siblings(selector): Gets the siblings of each element in the set of matched elements, optionally filtered by a selector.
  • .next(), .prev(): Gets the immediately following or preceding sibling of each element in the set of matched elements.

Manipulation

Cheerio provides methods to manipulate the selected elements:

  • .text([newText]): Gets the combined text contents of each element in the set of matched elements, or sets the text contents of the matched elements.
  • .html([newHtml]): Gets the HTML contents of the first element in the set of matched elements, or sets the HTML contents of every matched element.
  • .attr(attributeName, [value]): Gets the value of an attribute for the first element in the set of matched elements, or sets one or more attributes for every matched element.
  • .addClass(className), .removeClass(className), .toggleClass(className): Adds, removes, or toggles classes on the selected elements.

Example: Scraping a List of Items

// Assuming 'html' contains a webpage with a <ul> element
const $ = cheerio.load(html);

// Select all <li> elements within <ul>
const listItems = $('ul > li');

// Iterate over the list items and extract the text
const items = listItems.map((index, el) => {
  return $(el).text().trim();
}).get(); // '.get()' converts the Cheerio object to a regular array

console.log(items); // Outputs an array of list item texts

Cheerio is a powerful library for server-side DOM manipulation and is especially useful when combined with request-promise or axios libraries to fetch the content from the web before scraping.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon