Cheerio is a server-side implementation of jQuery that excels at parsing and manipulating HTML documents in Node.js. One of its primary use cases is extracting attributes from HTML elements during web scraping operations.
Installation
First, install Cheerio in your Node.js project:
npm install cheerio
Basic Attribute Extraction
1. Load HTML Content
const cheerio = require('cheerio');
const html = `
<div class="product" data-id="123" data-price="29.99">
<img src="/image.jpg" alt="Product Image" width="300" height="200">
<a href="/product/123" class="product-link" target="_blank">View Product</a>
</div>
`;
const $ = cheerio.load(html);
2. Extract Single Attributes
Use the .attr()
method to extract specific attributes:
// Extract src attribute from image
const imageSrc = $('img').attr('src');
console.log(imageSrc); // Output: /image.jpg
// Extract href from link
const productUrl = $('.product-link').attr('href');
console.log(productUrl); // Output: /product/123
// Extract data attributes
const productId = $('.product').attr('data-id');
console.log(productId); // Output: 123
Advanced Techniques
Working with Multiple Elements
When selecting multiple elements, .attr()
returns the attribute from the first matched element:
const html = `
<ul>
<li data-category="electronics" data-price="99.99">Laptop</li>
<li data-category="books" data-price="19.99">Novel</li>
<li data-category="clothing" data-price="39.99">T-Shirt</li>
</ul>
`;
const $ = cheerio.load(html);
// This returns only the first element's attribute
const firstCategory = $('li').attr('data-category');
console.log(firstCategory); // Output: electronics
// To get all attributes, iterate through elements
const allCategories = [];
$('li').each((index, element) => {
allCategories.push($(element).attr('data-category'));
});
console.log(allCategories); // Output: ['electronics', 'books', 'clothing']
Extracting Multiple Attributes from One Element
const $ = cheerio.load(html);
const productElement = $('.product');
// Method 1: Individual calls
const attributes = {
id: productElement.attr('data-id'),
price: productElement.attr('data-price'),
class: productElement.attr('class')
};
// Method 2: Using jQuery's .each() for dynamic attributes
const dynamicAttributes = {};
const element = productElement[0];
if (element && element.attribs) {
Object.keys(element.attribs).forEach(attr => {
dynamicAttributes[attr] = productElement.attr(attr);
});
}
console.log(attributes);
// Output: { id: '123', price: '29.99', class: 'product' }
Real-World Example: Scraping Product Information
const cheerio = require('cheerio');
const axios = require('axios'); // For fetching web pages
async function scrapeProductAttributes(url) {
try {
// Fetch the HTML content
const response = await axios.get(url);
const $ = cheerio.load(response.data);
const products = [];
$('.product-item').each((index, element) => {
const $element = $(element);
const product = {
id: $element.attr('data-product-id'),
name: $element.find('.product-name').text().trim(),
price: $element.find('.price').attr('data-price'),
image: $element.find('img').attr('src'),
link: $element.find('a').attr('href'),
inStock: $element.attr('data-in-stock') === 'true',
rating: parseFloat($element.find('.rating').attr('data-rating')) || 0
};
products.push(product);
});
return products;
} catch (error) {
console.error('Error scraping products:', error);
return [];
}
}
Handling Edge Cases
Checking for Attribute Existence
const $ = cheerio.load('<div class="test">Content</div>');
const element = $('.test');
// Check if attribute exists before using it
if (element.attr('data-id')) {
console.log('ID:', element.attr('data-id'));
} else {
console.log('No data-id attribute found');
}
// Alternative approach with default values
const id = element.attr('data-id') || 'default-id';
const title = element.attr('title') || 'No title';
Working with Boolean Attributes
const html = '<input type="checkbox" checked disabled>';
const $ = cheerio.load(html);
const checkbox = $('input');
// Boolean attributes return their name if present, undefined if not
const isChecked = checkbox.attr('checked') !== undefined;
const isDisabled = checkbox.attr('disabled') !== undefined;
console.log(isChecked); // Output: true
console.log(isDisabled); // Output: true
Common Attributes to Extract
// Links
const linkData = {
url: $('a').attr('href'),
target: $('a').attr('target'),
title: $('a').attr('title')
};
// Images
const imageData = {
src: $('img').attr('src'),
alt: $('img').attr('alt'),
width: $('img').attr('width'),
height: $('img').attr('height')
};
// Form elements
const inputData = {
type: $('input').attr('type'),
name: $('input').attr('name'),
value: $('input').attr('value'),
placeholder: $('input').attr('placeholder')
};
// Data attributes (commonly used for storing custom data)
const customData = {
userId: $('.user').attr('data-user-id'),
permissions: $('.user').attr('data-permissions'),
timestamp: $('.post').attr('data-timestamp')
};
Best Practices
- Always check for undefined: Attributes may not exist on elements
- Use specific selectors: Avoid overly broad selectors that might match unintended elements
- Handle errors gracefully: Wrap attribute extraction in try-catch blocks for production code
- Validate extracted data: Check data types and formats before using extracted attributes
- Respect robots.txt: Always check a website's scraping policies before extracting data
ES6 Module Syntax
For modern Node.js applications using ES6 modules:
import * as cheerio from 'cheerio';
import axios from 'axios';
const $ = cheerio.load(html);
const attributes = $('.element').attr('data-value');
Cheerio's .attr()
method provides a powerful and intuitive way to extract HTML attributes, making it an essential tool for web scraping and HTML parsing tasks in Node.js applications.