How do I use Cheerio in n8n for HTML Parsing?
Cheerio is a fast, flexible, and lean implementation of jQuery designed for server-side HTML parsing. In n8n, Cheerio is available through the HTML Extract node and Code node, making it an essential tool for extracting structured data from HTML documents within your automation workflows.
What is Cheerio?
Cheerio provides a jQuery-like API for traversing and manipulating HTML documents in Node.js. Unlike browser-based scraping tools like Puppeteer, Cheerio parses static HTML and doesn't execute JavaScript, making it extremely fast and memory-efficient for most web scraping tasks.
Using Cheerio in n8n Workflows
Method 1: HTML Extract Node (Recommended)
The HTML Extract node in n8n uses Cheerio under the hood and provides a user-friendly interface for HTML parsing without writing code.
Setup Steps:
- Add an HTTP Request node to fetch HTML content
- Add an HTML Extract node
- Configure CSS selectors to extract data
- Map the extracted data to output fields
Example Workflow:
HTTP Request → HTML Extract → Process Data
HTML Extract Node Configuration:
{
"extraction_mode": "HTML",
"css_selector": "article.product",
"return_array": true,
"values": {
"title": {
"cssSelector": "h2.product-title",
"returnValue": "innerText"
},
"price": {
"cssSelector": ".price",
"returnValue": "innerText"
},
"url": {
"cssSelector": "a",
"returnValue": "attribute",
"attribute": "href"
}
}
}
Method 2: Code Node with Cheerio
For advanced parsing scenarios, you can use Cheerio directly in n8n's Code node. Cheerio is pre-installed in n8n's execution environment.
Basic Example:
// Access HTML from previous node
const html = $input.item.json.data;
// Load HTML with Cheerio
const $ = cheerio.load(html);
// Extract data using CSS selectors
const products = [];
$('article.product').each((i, element) => {
const $element = $(element);
products.push({
title: $element.find('h2.product-title').text().trim(),
price: $element.find('.price').text().trim(),
url: $element.find('a').attr('href'),
image: $element.find('img').attr('src'),
description: $element.find('.description').text().trim()
});
});
// Return extracted data
return products.map(product => ({ json: product }));
Common Cheerio Selectors and Methods
CSS Selectors
Cheerio supports all standard CSS selectors:
// Element selectors
$('div') // All div elements
$('.class-name') // Elements with class
$('#element-id') // Element with specific ID
$('div.class') // Div with specific class
// Descendant selectors
$('div p') // All p inside div
$('ul > li') // Direct children only
$('h2 + p') // Adjacent sibling
$('h2 ~ p') // General sibling
// Attribute selectors
$('[data-id]') // Elements with data-id attribute
$('[href^="https"]') // href starting with https
$('[class*="product"]') // class containing "product"
Data Extraction Methods
// Text content
$('.title').text() // Get text content
$('.title').html() // Get HTML content
// Attributes
$('a').attr('href') // Get single attribute
$('img').attr('src', 'new') // Set attribute
// Data attributes
$('[data-id]').data('id') // Get data-id value
// CSS properties
$('.element').css('color') // Get CSS property
// Form values
$('input').val() // Get input value
Traversal Methods
// Parent/Child navigation
$('.child').parent() // Get parent element
$('.parent').children() // Get all children
$('.parent').find('.child') // Find descendants
// Sibling navigation
$('.element').next() // Next sibling
$('.element').prev() // Previous sibling
$('.element').siblings() // All siblings
// Filtering
$('li').first() // First element
$('li').last() // Last element
$('li').eq(2) // Element at index 2
$('li').filter('.active') // Filter by selector
Advanced n8n Cheerio Examples
Example 1: Scraping Product Listings
const html = $input.item.json.html;
const $ = cheerio.load(html);
const results = [];
$('.product-card').each((index, element) => {
const $product = $(element);
// Extract nested data
const specifications = {};
$product.find('.specs tr').each((i, row) => {
const $row = $(row);
const key = $row.find('th').text().trim();
const value = $row.find('td').text().trim();
specifications[key] = value;
});
results.push({
id: $product.attr('data-product-id'),
name: $product.find('h3.title').text().trim(),
price: parseFloat($product.find('.price').text().replace(/[^0-9.]/g, '')),
currency: $product.find('.price').data('currency'),
availability: $product.find('.stock').hasClass('in-stock'),
rating: parseFloat($product.find('.rating').attr('data-rating')),
reviewCount: parseInt($product.find('.reviews').text()),
imageUrl: $product.find('img').attr('src'),
specifications: specifications,
timestamp: new Date().toISOString()
});
});
return results.map(item => ({ json: item }));
Example 2: Extracting Table Data
const html = $input.item.json.content;
const $ = cheerio.load(html);
const tableData = [];
// Extract headers
const headers = [];
$('table thead th').each((i, element) => {
headers.push($(element).text().trim());
});
// Extract rows
$('table tbody tr').each((i, row) => {
const rowData = {};
$(row).find('td').each((j, cell) => {
const key = headers[j];
const value = $(cell).text().trim();
rowData[key] = value;
});
tableData.push(rowData);
});
return tableData.map(row => ({ json: row }));
Example 3: Handling Dynamic Content with Links
const html = $input.item.json.body;
const $ = cheerio.load(html);
const articles = [];
$('article').each((i, article) => {
const $article = $(article);
// Convert relative URLs to absolute
const baseUrl = 'https://example.com';
const relativeUrl = $article.find('a').attr('href');
const absoluteUrl = new URL(relativeUrl, baseUrl).href;
// Extract metadata
const publishDate = $article.find('time').attr('datetime');
const author = $article.find('.author').text().trim();
// Extract and clean text
const content = $article.find('.content')
.text()
.replace(/\s+/g, ' ')
.trim();
articles.push({
url: absoluteUrl,
title: $article.find('h2').text().trim(),
author: author,
publishDate: publishDate,
excerpt: content.substring(0, 200) + '...',
tags: $article.find('.tag').map((i, tag) => $(tag).text()).get()
});
});
return articles.map(article => ({ json: article }));
Cheerio vs Puppeteer in n8n
When building n8n workflows, choosing between Cheerio and Puppeteer depends on your scraping requirements:
Use Cheerio when: - Scraping static HTML content - Speed and efficiency are priorities - No JavaScript execution is needed - Parsing API responses or RSS feeds - Processing large volumes of pages
Use Puppeteer when: - Content is loaded dynamically with JavaScript - You need to handle AJAX requests using Puppeteer - Interacting with forms or buttons is required - Taking screenshots is needed - Working with single page applications
Error Handling and Best Practices
Robust Error Handling
try {
const html = $input.item.json.html;
// Validate HTML exists
if (!html || typeof html !== 'string') {
throw new Error('Invalid HTML input');
}
const $ = cheerio.load(html);
const results = [];
$('.product').each((i, element) => {
const $element = $(element);
// Safely extract with fallbacks
const title = $element.find('h2').text().trim() || 'No title';
const price = $element.find('.price').text().trim() || '0';
const url = $element.find('a').attr('href') || null;
// Validate required fields
if (url) {
results.push({
title: title,
price: parseFloat(price.replace(/[^0-9.]/g, '')) || 0,
url: url
});
}
});
return results.map(item => ({ json: item }));
} catch (error) {
// Return error information
return [{
json: {
error: error.message,
timestamp: new Date().toISOString()
}
}];
}
Performance Optimization
// Load with options for better performance
const $ = cheerio.load(html, {
xml: false, // Parse as HTML, not XML
decodeEntities: true, // Decode HTML entities
normalizeWhitespace: true // Normalize whitespace
});
// Use efficient selectors
// Bad: $('div').filter('.product')
// Good: $('.product')
// Cache frequently used selectors
const $products = $('.product');
const productCount = $products.length;
$products.each((i, element) => {
// Process elements
});
Data Cleaning
// Clean and normalize extracted data
function cleanText(text) {
return text
.replace(/\s+/g, ' ') // Normalize whitespace
.replace(/[\n\r\t]/g, '') // Remove line breaks
.trim(); // Trim edges
}
function cleanPrice(priceText) {
const cleaned = priceText.replace(/[^0-9.,]/g, '');
return parseFloat(cleaned.replace(',', '.'));
}
function cleanUrl(url, baseUrl) {
if (!url) return null;
try {
return new URL(url, baseUrl).href;
} catch {
return null;
}
}
// Apply cleaning functions
const product = {
title: cleanText($element.find('.title').text()),
price: cleanPrice($element.find('.price').text()),
url: cleanUrl($element.find('a').attr('href'), 'https://example.com')
};
Integration with WebScraping.AI
When Cheerio isn't sufficient for your needs, consider using WebScraping.AI's API within your n8n workflows. This is particularly useful when you need to handle authentication or scrape JavaScript-heavy websites.
n8n HTTP Request Configuration:
// Configure HTTP Request node for WebScraping.AI
{
"method": "GET",
"url": "https://api.webscraping.ai/html",
"qs": {
"api_key": "{{$credentials.webScrapingAI.apiKey}}",
"url": "{{$node['Webhook'].json['targetUrl']}}",
"js": "true"
}
}
Then process the response with Cheerio in a Code node for maximum flexibility and reliability.
Common Pitfalls and Solutions
Issue 1: Empty Results
Problem: Selectors return no results
Solution:
// Debug by logging HTML structure
console.log($('body').html().substring(0, 500));
// Check if element exists
if ($('.selector').length === 0) {
console.log('Selector not found');
}
// Try alternative selectors
const selectors = ['.price', '[data-price]', '.product-price'];
let price = null;
for (const selector of selectors) {
price = $(selector).first().text();
if (price) break;
}
Issue 2: Incorrect Text Extraction
Problem: Extra whitespace or special characters
Solution:
// Use proper text extraction and cleaning
const text = $element
.find('.content')
.text()
.replace(/\s+/g, ' ')
.trim();
// Remove special characters if needed
const cleaned = text.replace(/[^\w\s.,!?-]/g, '');
Issue 3: Missing Attributes
Problem: Attributes return undefined
Solution:
// Always check attribute existence
const href = $element.find('a').attr('href');
const safeHref = href || null;
// Use optional chaining in modern Node.js
const dataId = $element.find('[data-id]').attr('data-id') ?? 'unknown';
Conclusion
Cheerio is a powerful and efficient tool for HTML parsing in n8n workflows. By leveraging its jQuery-like syntax, you can quickly extract structured data from HTML documents without the overhead of a full browser. Whether you're using the HTML Extract node for simple tasks or the Code node for complex parsing logic, Cheerio provides the flexibility and performance needed for production web scraping workflows.
For dynamic content that requires JavaScript execution, consider combining Cheerio with Puppeteer nodes or using WebScraping.AI's API to handle the rendering before parsing with Cheerio. This hybrid approach gives you the best of both worlds: the power of browser automation when needed and the speed of static parsing when possible.