Cheerio is a fast, flexible, and lean implementation of core jQuery designed specifically for the server to parse, manipulate, and render HTML. It's the go-to tool for Node.js developers who need to scrape and manipulate HTML documents using familiar jQuery syntax.
Installation
First, install Cheerio via npm or yarn:
npm install cheerio
# or
yarn add cheerio
Basic Element Selection
Cheerio uses CSS selectors to target elements, just like jQuery. Here's the fundamental syntax:
const cheerio = require('cheerio');
// Sample HTML
const html = `
<html>
<head>
<title>Web Scraping Tutorial</title>
</head>
<body>
<h1 id="main-title">Welcome to My Website</h1>
<div class="content">
<p class="intro">This is the introduction paragraph.</p>
<ul class="navigation">
<li><a href="/home">Home</a></li>
<li><a href="/about">About</a></li>
<li><a href="/contact">Contact</a></li>
</ul>
<article data-category="tech">
<h2>Latest Tech News</h2>
<p>Technology updates here...</p>
</article>
</div>
</body>
</html>
`;
// Load HTML into Cheerio
const $ = cheerio.load(html);
Common Selection Methods
1. Basic Selectors
// Select by tag name
const title = $('h1').text();
console.log(title); // "Welcome to My Website"
// Select by ID
const mainTitle = $('#main-title').text();
// Select by class
const intro = $('.intro').text();
// Select by attribute
const techArticle = $('[data-category="tech"]').find('h2').text();
console.log(techArticle); // "Latest Tech News"
2. CSS Combinators
// Direct child selector
const navLinks = $('.navigation > li');
// Descendant selector
const allLinks = $('.content a');
// Adjacent sibling selector
const nextElement = $('h1 + div');
// General sibling selector
const allSiblings = $('h1 ~ div');
3. Pseudo-selectors
// First and last elements
const firstLink = $('.navigation li:first-child a').text();
const lastLink = $('.navigation li:last-child a').text();
// Nth child
const secondLink = $('.navigation li:nth-child(2) a').text();
// Contains text
const homeLink = $('a:contains("Home")').attr('href');
DOM Traversal Methods
Cheerio provides powerful methods for navigating the DOM tree:
// Find descendant elements
const links = $('.navigation').find('a');
// Get parent elements
const listParent = $('.navigation li').parent();
// Get children
const navItems = $('.navigation').children('li');
// Get siblings
const allSiblings = $('.intro').siblings();
const nextSibling = $('.intro').next();
const prevSibling = $('.intro').prev();
// Get closest ancestor matching selector
const contentDiv = $('.intro').closest('.content');
Practical Web Scraping Examples
Example 1: Extracting All Links
const $ = cheerio.load(html);
const links = [];
$('a').each((index, element) => {
const link = {
text: $(element).text().trim(),
href: $(element).attr('href'),
title: $(element).attr('title') || null
};
links.push(link);
});
console.log(links);
// Output: Array of link objects with text, href, and title
Example 2: Extracting Table Data
const tableHtml = `
<table class="data-table">
<thead>
<tr><th>Name</th><th>Age</th><th>City</th></tr>
</thead>
<tbody>
<tr><td>John</td><td>25</td><td>New York</td></tr>
<tr><td>Jane</td><td>30</td><td>Los Angeles</td></tr>
</tbody>
</table>
`;
const $ = cheerio.load(tableHtml);
const tableData = [];
$('.data-table tbody tr').each((index, row) => {
const rowData = {};
$(row).find('td').each((cellIndex, cell) => {
const headers = ['name', 'age', 'city'];
rowData[headers[cellIndex]] = $(cell).text().trim();
});
tableData.push(rowData);
});
console.log(tableData);
// Output: [{ name: 'John', age: '25', city: 'New York' }, ...]
Example 3: Complex Selector Combinations
// Select elements with multiple conditions
const specificElements = $('div.content p:not(.intro)');
// Select elements by attribute value
const techArticles = $('article[data-category="tech"]');
// Combine multiple selectors
const importantElements = $('.intro, #main-title, .navigation a');
// Filter results
const externalLinks = $('a').filter((index, element) => {
const href = $(element).attr('href');
return href && href.startsWith('http');
});
Error Handling and Best Practices
const cheerio = require('cheerio');
function safeSelect(html, selector) {
try {
const $ = cheerio.load(html);
const elements = $(selector);
if (elements.length === 0) {
console.warn(`No elements found for selector: ${selector}`);
return null;
}
return elements;
} catch (error) {
console.error('Error parsing HTML:', error.message);
return null;
}
}
// Usage
const elements = safeSelect(html, '.non-existent-class');
Performance Tips
- Use specific selectors: More specific selectors are faster than broad ones
- Cache the Cheerio object: Don't reload HTML unnecessarily
- Use
.get()
when needed: Convert Cheerio objects to arrays when working with regular JavaScript methods
// Efficient way to work with large lists
const $ = cheerio.load(html);
const items = $('.list-item')
.map((i, el) => $(el).text().trim())
.get() // Convert to regular array
.filter(text => text.length > 0);
Cheerio's jQuery-like syntax makes it an excellent choice for server-side HTML parsing and web scraping tasks. When combined with HTTP libraries like axios or node-fetch, it provides a powerful foundation for any web scraping project.