What is CheerioCrawler and when is it the best choice?
CheerioCrawler is one of the core crawler classes in the Crawlee web scraping framework. It's a lightweight, HTTP-based crawler that uses Cheerio for parsing HTML content. Unlike browser-based crawlers like PuppeteerCrawler or PlaywrightCrawler, CheerioCrawler doesn't launch a real browser. Instead, it makes plain HTTP requests and parses the HTML responses, making it significantly faster and more resource-efficient for scraping static web pages.
Understanding CheerioCrawler
CheerioCrawler is designed for scraping websites that serve content directly in the initial HTML response without relying on JavaScript to render the page. It's ideal for traditional server-side rendered websites, APIs that return HTML, and any content that doesn't require JavaScript execution to be visible.
How CheerioCrawler Works
When you use CheerioCrawler, it performs the following steps:
- HTTP Request: Makes a standard HTTP GET request to the target URL
- HTML Parsing: Parses the response body using Cheerio, which provides a jQuery-like API
- Data Extraction: Allows you to extract data using CSS selectors or DOM traversal
- Link Discovery: Automatically discovers and enqueues new URLs to crawl
- Queue Management: Manages the request queue with automatic retries and error handling
Basic CheerioCrawler Example
Here's a simple example of using CheerioCrawler to scrape product information:
import { CheerioCrawler, Dataset } from 'crawlee';
const crawler = new CheerioCrawler({
// Maximum number of concurrent requests
maxConcurrency: 10,
// Request handler function
async requestHandler({ request, $, enqueueLinks }) {
console.log(`Processing: ${request.url}`);
// Extract data using Cheerio's jQuery-like syntax
const title = $('h1.product-title').text().trim();
const price = $('span.price').text().trim();
const description = $('div.description').text().trim();
// Save the extracted data
await Dataset.pushData({
url: request.url,
title,
price,
description,
});
// Automatically enqueue links matching the selector
await enqueueLinks({
selector: 'a.product-link',
label: 'PRODUCT',
});
},
// Optional: Handle failed requests
failedRequestHandler({ request }) {
console.log(`Request ${request.url} failed too many times`);
},
});
// Start the crawler
await crawler.run(['https://example.com/products']);
When to Use CheerioCrawler
CheerioCrawler is the best choice in several scenarios:
1. Static HTML Content
If the website you're scraping serves all content in the initial HTML response without JavaScript rendering, CheerioCrawler is ideal. Most traditional websites, blogs, news sites, and e-commerce platforms with server-side rendering fall into this category.
import { CheerioCrawler } from 'crawlee';
const crawler = new CheerioCrawler({
async requestHandler({ $, request }) {
// Extract article data from static HTML
const articles = [];
$('article.post').each((_, element) => {
articles.push({
title: $(element).find('h2').text(),
author: $(element).find('.author').text(),
date: $(element).find('.date').text(),
excerpt: $(element).find('.excerpt').text(),
});
});
console.log(`Found ${articles.length} articles on ${request.url}`);
},
});
await crawler.run(['https://blog.example.com']);
2. High-Performance Requirements
When you need to scrape large amounts of data quickly, CheerioCrawler's efficiency is unmatched. It can handle hundreds of concurrent requests without the memory overhead of browser instances.
const crawler = new CheerioCrawler({
// High concurrency for maximum throughput
maxConcurrency: 50,
// Disable auto-saved snapshots for better performance
autoscaledPoolOptions: {
snapshotterOptions: {
eventLoopSnapshotIntervalSecs: 2,
},
},
async requestHandler({ $, request }) {
// Fast data extraction
const data = {
url: request.url,
items: $('div.item').length,
};
await Dataset.pushData(data);
},
});
3. Resource-Constrained Environments
If you're running your scraper on limited hardware, in Docker containers, or serverless functions, CheerioCrawler's minimal resource footprint is crucial. Unlike browser-based solutions that require Docker configuration, CheerioCrawler runs with minimal dependencies.
4. API-Like HTML Responses
When scraping data from endpoints that return HTML fragments or structured HTML data, CheerioCrawler is perfect:
import { CheerioCrawler, Dataset } from 'crawlee';
const crawler = new CheerioCrawler({
async requestHandler({ $, request, json }) {
// Some APIs return HTML in JSON responses
const htmlContent = json?.html || request.loadedUrl;
// Parse the HTML fragment
const products = [];
$('.product-card').each((_, el) => {
products.push({
id: $(el).data('product-id'),
name: $(el).find('.name').text(),
price: $(el).find('.price').text(),
});
});
await Dataset.pushData(products);
},
});
When NOT to Use CheerioCrawler
CheerioCrawler has limitations that make it unsuitable for certain scenarios:
1. JavaScript-Rendered Content
If the website uses JavaScript frameworks like React, Vue, or Angular to render content dynamically, CheerioCrawler won't be able to extract that data. In these cases, you need PuppeteerCrawler or PlaywrightCrawler.
Example of content that requires a browser: - Single Page Applications (SPAs) - Infinite scroll pages - Content loaded via AJAX after page load - Interactive dashboards and web applications
2. Complex User Interactions
When you need to simulate user interactions like clicking buttons, filling forms, or handling authentication flows, a browser-based crawler is necessary.
3. Pages with Bot Detection
Some websites implement sophisticated bot detection that checks for browser-specific features. CheerioCrawler's plain HTTP requests may be blocked, while browser-based crawlers can bypass these checks more effectively.
Advanced CheerioCrawler Features
Custom HTTP Headers
You can customize request headers to mimic real browsers or include authentication tokens:
const crawler = new CheerioCrawler({
// Set default headers for all requests
requestHandlerTimeoutSecs: 60,
async requestHandler({ request, $, crawler }) {
// Access the underlying Got HTTP client
},
// Use pre-navigation hooks to modify requests
preNavigationHooks: [
(crawlingContext, gotoOptions) => {
gotoOptions.headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept-Language': 'en-US,en;q=0.9',
'Accept': 'text/html,application/xhtml+xml',
};
},
],
});
Handling Different Content Types
CheerioCrawler can handle various response types:
import { CheerioCrawler } from 'crawlee';
const crawler = new CheerioCrawler({
async requestHandler({ request, $, body, contentType }) {
// Check content type
if (contentType.includes('application/json')) {
const data = JSON.parse(body);
console.log('JSON response:', data);
} else if (contentType.includes('text/html')) {
// Parse HTML with Cheerio
const title = $('title').text();
console.log('Page title:', title);
}
},
});
Proxy Support
CheerioCrawler includes built-in proxy rotation support:
import { CheerioCrawler, ProxyConfiguration } from 'crawlee';
const proxyConfiguration = new ProxyConfiguration({
proxyUrls: [
'http://proxy1.example.com:8000',
'http://proxy2.example.com:8000',
],
});
const crawler = new CheerioCrawler({
proxyConfiguration,
async requestHandler({ request, $, proxyInfo }) {
console.log(`Using proxy: ${proxyInfo?.url}`);
// Your scraping logic here
},
});
CheerioCrawler vs PuppeteerCrawler
Here's a comparison to help you choose the right crawler:
| Feature | CheerioCrawler | PuppeteerCrawler | |---------|----------------|------------------| | Speed | Very fast (50-100 req/s) | Slower (5-20 req/s) | | Memory Usage | Low (~50MB) | High (~200-500MB per browser) | | JavaScript Support | No | Yes | | Browser Features | No | Full browser capabilities | | Setup Complexity | Simple | Requires browser installation | | Best For | Static HTML sites | JavaScript-heavy sites | | Concurrent Requests | 50-100+ | 5-20 typically |
Python Alternative: BeautifulSoup
If you're working in Python, the equivalent approach uses libraries like BeautifulSoup or lxml:
import requests
from bs4 import BeautifulSoup
from typing import List, Dict
def scrape_with_beautifulsoup(url: str) -> List[Dict]:
"""
Scrape a page using requests + BeautifulSoup
(equivalent to CheerioCrawler approach)
"""
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
response = requests.get(url, headers=headers)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
products = []
for item in soup.select('div.product'):
products.append({
'title': item.select_one('h2').get_text(strip=True),
'price': item.select_one('.price').get_text(strip=True),
'url': item.select_one('a')['href'],
})
return products
# Usage
results = scrape_with_beautifulsoup('https://example.com/products')
print(f'Found {len(results)} products')
Best Practices for CheerioCrawler
1. Optimize Concurrency
Start with conservative concurrency and gradually increase:
const crawler = new CheerioCrawler({
maxConcurrency: 10,
autoscaledPoolOptions: {
desiredConcurrency: 20,
maxConcurrency: 50,
},
async requestHandler({ request, $ }) {
// Your scraping logic
},
});
2. Implement Proper Error Handling
Always handle errors gracefully to ensure crawler stability:
const crawler = new CheerioCrawler({
maxRequestRetries: 3,
async requestHandler({ request, $ }) {
try {
// Your scraping logic
const data = extractData($);
await Dataset.pushData(data);
} catch (error) {
console.error(`Error processing ${request.url}:`, error);
throw error; // Retries will be attempted
}
},
failedRequestHandler({ request, error }) {
console.log(`Final failure for ${request.url}: ${error.message}`);
},
});
3. Use Request Labels
Organize your crawling logic with request labels:
const crawler = new CheerioCrawler({
async requestHandler({ request, $, enqueueLinks }) {
if (request.label === 'CATEGORY') {
// Handle category pages
await enqueueLinks({
selector: 'a.product-link',
label: 'PRODUCT',
});
} else if (request.label === 'PRODUCT') {
// Handle product pages
const productData = extractProductData($);
await Dataset.pushData(productData);
}
},
});
await crawler.run([
{ url: 'https://example.com/category', label: 'CATEGORY' },
]);
Conclusion
CheerioCrawler is an excellent choice for scraping static HTML content efficiently. It's fast, lightweight, and perfect for traditional websites that don't rely on JavaScript for content rendering. Use it when you need high performance and the target website serves content in the initial HTML response. For JavaScript-heavy sites or when you need to handle browser events and complex interactions, consider switching to PuppeteerCrawler or PlaywrightCrawler instead.
The key to successful web scraping is choosing the right tool for the job. Start with CheerioCrawler for its speed and efficiency, and only move to browser-based solutions when you encounter JavaScript-rendered content or need advanced browser features.