Cheerio is a fast, flexible, and lean implementation of core jQuery designed specifically for the server to parse, manipulate, and render HTML. It's a popular tool among Node.js developers for web scraping because it uses a very familiar jQuery syntax.
To select elements using Cheerio, you follow much the same process as you would in a browser with jQuery. Let's go through the basic steps:
Installation
First, you need to install Cheerio via npm or yarn if you haven't already:
npm install cheerio
or
yarn add cheerio
Usage
Once Cheerio is installed, you can load an HTML document and start selecting elements. Here's a basic example in JavaScript:
const cheerio = require('cheerio');
// Sample HTML
const html = `
<html>
<head>
<title>Test Page</title>
</head>
<body>
<h1>Welcome to My Test Page</h1>
<div class="content">
<p>This is a paragraph inside a content div.</p>
<ul class="list">
<li class="item">Item 1</li>
<li class="item">Item 2</li>
<li class="item">Item 3</li>
</ul>
</div>
</body>
</html>
`;
// Load HTML into Cheerio
const $ = cheerio.load(html);
// Select elements using CSS selectors
const title = $('title').text();
console.log(title); // Output: Test Page
const listItems = $('.list .item');
listItems.each(function (i, el) {
// 'this' is the current element in the loop
console.log($(this).text()); // Output: Item 1, Item 2, Item 3
});
// You can also manipulate elements
$('h1').text('Updated Title');
console.log($('h1').text()); // Output: Updated Title
Selecting Elements
Cheerio uses CSS selectors to target elements, and it provides several methods to traverse and manipulate the DOM:
$(selector)
: This is the primary function to query elements in the DOM. It works just like jQuery's$
..find(selector)
: Searches for descendant elements that match the selector..parent()
,.parents(selector)
: Gets the parent or ancestors of each element in the set of matched elements, optionally filtered by a selector..children(selector)
: Gets the children of each element in the set of matched elements, optionally filtered by a selector..siblings(selector)
: Gets the siblings of each element in the set of matched elements, optionally filtered by a selector..next()
,.prev()
: Gets the immediately following or preceding sibling of each element in the set of matched elements.
Manipulation
Cheerio provides methods to manipulate the selected elements:
.text([newText])
: Gets the combined text contents of each element in the set of matched elements, or sets the text contents of the matched elements..html([newHtml])
: Gets the HTML contents of the first element in the set of matched elements, or sets the HTML contents of every matched element..attr(attributeName, [value])
: Gets the value of an attribute for the first element in the set of matched elements, or sets one or more attributes for every matched element..addClass(className)
,.removeClass(className)
,.toggleClass(className)
: Adds, removes, or toggles classes on the selected elements.
Example: Scraping a List of Items
// Assuming 'html' contains a webpage with a <ul> element
const $ = cheerio.load(html);
// Select all <li> elements within <ul>
const listItems = $('ul > li');
// Iterate over the list items and extract the text
const items = listItems.map((index, el) => {
return $(el).text().trim();
}).get(); // '.get()' converts the Cheerio object to a regular array
console.log(items); // Outputs an array of list item texts
Cheerio is a powerful library for server-side DOM manipulation and is especially useful when combined with request-promise or axios libraries to fetch the content from the web before scraping.