Table of contents

How do I use Cheerio in n8n for HTML Parsing?

Cheerio is a fast, flexible, and lean implementation of jQuery designed for server-side HTML parsing. In n8n, Cheerio is available through the HTML Extract node and Code node, making it an essential tool for extracting structured data from HTML documents within your automation workflows.

What is Cheerio?

Cheerio provides a jQuery-like API for traversing and manipulating HTML documents in Node.js. Unlike browser-based scraping tools like Puppeteer, Cheerio parses static HTML and doesn't execute JavaScript, making it extremely fast and memory-efficient for most web scraping tasks.

Using Cheerio in n8n Workflows

Method 1: HTML Extract Node (Recommended)

The HTML Extract node in n8n uses Cheerio under the hood and provides a user-friendly interface for HTML parsing without writing code.

Setup Steps:

  1. Add an HTTP Request node to fetch HTML content
  2. Add an HTML Extract node
  3. Configure CSS selectors to extract data
  4. Map the extracted data to output fields

Example Workflow:

HTTP Request → HTML Extract → Process Data

HTML Extract Node Configuration:

{
  "extraction_mode": "HTML",
  "css_selector": "article.product",
  "return_array": true,
  "values": {
    "title": {
      "cssSelector": "h2.product-title",
      "returnValue": "innerText"
    },
    "price": {
      "cssSelector": ".price",
      "returnValue": "innerText"
    },
    "url": {
      "cssSelector": "a",
      "returnValue": "attribute",
      "attribute": "href"
    }
  }
}

Method 2: Code Node with Cheerio

For advanced parsing scenarios, you can use Cheerio directly in n8n's Code node. Cheerio is pre-installed in n8n's execution environment.

Basic Example:

// Access HTML from previous node
const html = $input.item.json.data;

// Load HTML with Cheerio
const $ = cheerio.load(html);

// Extract data using CSS selectors
const products = [];

$('article.product').each((i, element) => {
  const $element = $(element);

  products.push({
    title: $element.find('h2.product-title').text().trim(),
    price: $element.find('.price').text().trim(),
    url: $element.find('a').attr('href'),
    image: $element.find('img').attr('src'),
    description: $element.find('.description').text().trim()
  });
});

// Return extracted data
return products.map(product => ({ json: product }));

Common Cheerio Selectors and Methods

CSS Selectors

Cheerio supports all standard CSS selectors:

// Element selectors
$('div')              // All div elements
$('.class-name')      // Elements with class
$('#element-id')      // Element with specific ID
$('div.class')        // Div with specific class

// Descendant selectors
$('div p')            // All p inside div
$('ul > li')          // Direct children only
$('h2 + p')           // Adjacent sibling
$('h2 ~ p')           // General sibling

// Attribute selectors
$('[data-id]')        // Elements with data-id attribute
$('[href^="https"]')  // href starting with https
$('[class*="product"]') // class containing "product"

Data Extraction Methods

// Text content
$('.title').text()           // Get text content
$('.title').html()           // Get HTML content

// Attributes
$('a').attr('href')          // Get single attribute
$('img').attr('src', 'new')  // Set attribute

// Data attributes
$('[data-id]').data('id')    // Get data-id value

// CSS properties
$('.element').css('color')   // Get CSS property

// Form values
$('input').val()             // Get input value

Traversal Methods

// Parent/Child navigation
$('.child').parent()         // Get parent element
$('.parent').children()      // Get all children
$('.parent').find('.child')  // Find descendants

// Sibling navigation
$('.element').next()         // Next sibling
$('.element').prev()         // Previous sibling
$('.element').siblings()     // All siblings

// Filtering
$('li').first()              // First element
$('li').last()               // Last element
$('li').eq(2)                // Element at index 2
$('li').filter('.active')    // Filter by selector

Advanced n8n Cheerio Examples

Example 1: Scraping Product Listings

const html = $input.item.json.html;
const $ = cheerio.load(html);
const results = [];

$('.product-card').each((index, element) => {
  const $product = $(element);

  // Extract nested data
  const specifications = {};
  $product.find('.specs tr').each((i, row) => {
    const $row = $(row);
    const key = $row.find('th').text().trim();
    const value = $row.find('td').text().trim();
    specifications[key] = value;
  });

  results.push({
    id: $product.attr('data-product-id'),
    name: $product.find('h3.title').text().trim(),
    price: parseFloat($product.find('.price').text().replace(/[^0-9.]/g, '')),
    currency: $product.find('.price').data('currency'),
    availability: $product.find('.stock').hasClass('in-stock'),
    rating: parseFloat($product.find('.rating').attr('data-rating')),
    reviewCount: parseInt($product.find('.reviews').text()),
    imageUrl: $product.find('img').attr('src'),
    specifications: specifications,
    timestamp: new Date().toISOString()
  });
});

return results.map(item => ({ json: item }));

Example 2: Extracting Table Data

const html = $input.item.json.content;
const $ = cheerio.load(html);
const tableData = [];

// Extract headers
const headers = [];
$('table thead th').each((i, element) => {
  headers.push($(element).text().trim());
});

// Extract rows
$('table tbody tr').each((i, row) => {
  const rowData = {};
  $(row).find('td').each((j, cell) => {
    const key = headers[j];
    const value = $(cell).text().trim();
    rowData[key] = value;
  });
  tableData.push(rowData);
});

return tableData.map(row => ({ json: row }));

Example 3: Handling Dynamic Content with Links

const html = $input.item.json.body;
const $ = cheerio.load(html);
const articles = [];

$('article').each((i, article) => {
  const $article = $(article);

  // Convert relative URLs to absolute
  const baseUrl = 'https://example.com';
  const relativeUrl = $article.find('a').attr('href');
  const absoluteUrl = new URL(relativeUrl, baseUrl).href;

  // Extract metadata
  const publishDate = $article.find('time').attr('datetime');
  const author = $article.find('.author').text().trim();

  // Extract and clean text
  const content = $article.find('.content')
    .text()
    .replace(/\s+/g, ' ')
    .trim();

  articles.push({
    url: absoluteUrl,
    title: $article.find('h2').text().trim(),
    author: author,
    publishDate: publishDate,
    excerpt: content.substring(0, 200) + '...',
    tags: $article.find('.tag').map((i, tag) => $(tag).text()).get()
  });
});

return articles.map(article => ({ json: article }));

Cheerio vs Puppeteer in n8n

When building n8n workflows, choosing between Cheerio and Puppeteer depends on your scraping requirements:

Use Cheerio when: - Scraping static HTML content - Speed and efficiency are priorities - No JavaScript execution is needed - Parsing API responses or RSS feeds - Processing large volumes of pages

Use Puppeteer when: - Content is loaded dynamically with JavaScript - You need to handle AJAX requests using Puppeteer - Interacting with forms or buttons is required - Taking screenshots is needed - Working with single page applications

Error Handling and Best Practices

Robust Error Handling

try {
  const html = $input.item.json.html;

  // Validate HTML exists
  if (!html || typeof html !== 'string') {
    throw new Error('Invalid HTML input');
  }

  const $ = cheerio.load(html);
  const results = [];

  $('.product').each((i, element) => {
    const $element = $(element);

    // Safely extract with fallbacks
    const title = $element.find('h2').text().trim() || 'No title';
    const price = $element.find('.price').text().trim() || '0';
    const url = $element.find('a').attr('href') || null;

    // Validate required fields
    if (url) {
      results.push({
        title: title,
        price: parseFloat(price.replace(/[^0-9.]/g, '')) || 0,
        url: url
      });
    }
  });

  return results.map(item => ({ json: item }));

} catch (error) {
  // Return error information
  return [{
    json: {
      error: error.message,
      timestamp: new Date().toISOString()
    }
  }];
}

Performance Optimization

// Load with options for better performance
const $ = cheerio.load(html, {
  xml: false,              // Parse as HTML, not XML
  decodeEntities: true,    // Decode HTML entities
  normalizeWhitespace: true // Normalize whitespace
});

// Use efficient selectors
// Bad: $('div').filter('.product')
// Good: $('.product')

// Cache frequently used selectors
const $products = $('.product');
const productCount = $products.length;

$products.each((i, element) => {
  // Process elements
});

Data Cleaning

// Clean and normalize extracted data
function cleanText(text) {
  return text
    .replace(/\s+/g, ' ')           // Normalize whitespace
    .replace(/[\n\r\t]/g, '')       // Remove line breaks
    .trim();                         // Trim edges
}

function cleanPrice(priceText) {
  const cleaned = priceText.replace(/[^0-9.,]/g, '');
  return parseFloat(cleaned.replace(',', '.'));
}

function cleanUrl(url, baseUrl) {
  if (!url) return null;
  try {
    return new URL(url, baseUrl).href;
  } catch {
    return null;
  }
}

// Apply cleaning functions
const product = {
  title: cleanText($element.find('.title').text()),
  price: cleanPrice($element.find('.price').text()),
  url: cleanUrl($element.find('a').attr('href'), 'https://example.com')
};

Integration with WebScraping.AI

When Cheerio isn't sufficient for your needs, consider using WebScraping.AI's API within your n8n workflows. This is particularly useful when you need to handle authentication or scrape JavaScript-heavy websites.

n8n HTTP Request Configuration:

// Configure HTTP Request node for WebScraping.AI
{
  "method": "GET",
  "url": "https://api.webscraping.ai/html",
  "qs": {
    "api_key": "{{$credentials.webScrapingAI.apiKey}}",
    "url": "{{$node['Webhook'].json['targetUrl']}}",
    "js": "true"
  }
}

Then process the response with Cheerio in a Code node for maximum flexibility and reliability.

Common Pitfalls and Solutions

Issue 1: Empty Results

Problem: Selectors return no results

Solution:

// Debug by logging HTML structure
console.log($('body').html().substring(0, 500));

// Check if element exists
if ($('.selector').length === 0) {
  console.log('Selector not found');
}

// Try alternative selectors
const selectors = ['.price', '[data-price]', '.product-price'];
let price = null;

for (const selector of selectors) {
  price = $(selector).first().text();
  if (price) break;
}

Issue 2: Incorrect Text Extraction

Problem: Extra whitespace or special characters

Solution:

// Use proper text extraction and cleaning
const text = $element
  .find('.content')
  .text()
  .replace(/\s+/g, ' ')
  .trim();

// Remove special characters if needed
const cleaned = text.replace(/[^\w\s.,!?-]/g, '');

Issue 3: Missing Attributes

Problem: Attributes return undefined

Solution:

// Always check attribute existence
const href = $element.find('a').attr('href');
const safeHref = href || null;

// Use optional chaining in modern Node.js
const dataId = $element.find('[data-id]').attr('data-id') ?? 'unknown';

Conclusion

Cheerio is a powerful and efficient tool for HTML parsing in n8n workflows. By leveraging its jQuery-like syntax, you can quickly extract structured data from HTML documents without the overhead of a full browser. Whether you're using the HTML Extract node for simple tasks or the Code node for complex parsing logic, Cheerio provides the flexibility and performance needed for production web scraping workflows.

For dynamic content that requires JavaScript execution, consider combining Cheerio with Puppeteer nodes or using WebScraping.AI's API to handle the rendering before parsing with Cheerio. This hybrid approach gives you the best of both worlds: the power of browser automation when needed and the speed of static parsing when possible.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon