What is CheerioCrawler and when is it the best choice?

CheerioCrawler is one of the core crawler classes in the Crawlee web scraping framework. It's a lightweight, HTTP-based crawler that uses Cheerio for parsing HTML content. Unlike browser-based crawlers like PuppeteerCrawler or PlaywrightCrawler, CheerioCrawler doesn't launch a real browser. Instead, it makes plain HTTP requests and parses the HTML responses, making it significantly faster and more resource-efficient for scraping static web pages.

Understanding CheerioCrawler

CheerioCrawler is designed for scraping websites that serve content directly in the initial HTML response without relying on JavaScript to render the page. It's ideal for traditional server-side rendered websites, APIs that return HTML, and any content that doesn't require JavaScript execution to be visible.

How CheerioCrawler Works

When you use CheerioCrawler, it performs the following steps:

HTTP Request: Makes a standard HTTP GET request to the target URL
HTML Parsing: Parses the response body using Cheerio, which provides a jQuery-like API
Data Extraction: Allows you to extract data using CSS selectors or DOM traversal
Link Discovery: Automatically discovers and enqueues new URLs to crawl
Queue Management: Manages the request queue with automatic retries and error handling

Basic CheerioCrawler Example

Here's a simple example of using CheerioCrawler to scrape product information:

import { CheerioCrawler, Dataset } from 'crawlee';

const crawler = new CheerioCrawler({
    // Maximum number of concurrent requests
    maxConcurrency: 10,

    // Request handler function
    async requestHandler({ request, $, enqueueLinks }) {
        console.log(`Processing: ${request.url}`);

        // Extract data using Cheerio's jQuery-like syntax
        const title = $('h1.product-title').text().trim();
        const price = $('span.price').text().trim();
        const description = $('div.description').text().trim();

        // Save the extracted data
        await Dataset.pushData({
            url: request.url,
            title,
            price,
            description,
        });

        // Automatically enqueue links matching the selector
        await enqueueLinks({
            selector: 'a.product-link',
            label: 'PRODUCT',
        });
    },

    // Optional: Handle failed requests
    failedRequestHandler({ request }) {
        console.log(`Request ${request.url} failed too many times`);
    },
});

// Start the crawler
await crawler.run(['https://example.com/products']);

When to Use CheerioCrawler

CheerioCrawler is the best choice in several scenarios:

1. Static HTML Content

If the website you're scraping serves all content in the initial HTML response without JavaScript rendering, CheerioCrawler is ideal. Most traditional websites, blogs, news sites, and e-commerce platforms with server-side rendering fall into this category.

import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ $, request }) {
        // Extract article data from static HTML
        const articles = [];

        $('article.post').each((_, element) => {
            articles.push({
                title: $(element).find('h2').text(),
                author: $(element).find('.author').text(),
                date: $(element).find('.date').text(),
                excerpt: $(element).find('.excerpt').text(),
            });
        });

        console.log(`Found ${articles.length} articles on ${request.url}`);
    },
});

await crawler.run(['https://blog.example.com']);

2. High-Performance Requirements

When you need to scrape large amounts of data quickly, CheerioCrawler's efficiency is unmatched. It can handle hundreds of concurrent requests without the memory overhead of browser instances.

const crawler = new CheerioCrawler({
    // High concurrency for maximum throughput
    maxConcurrency: 50,

    // Disable auto-saved snapshots for better performance
    autoscaledPoolOptions: {
        snapshotterOptions: {
            eventLoopSnapshotIntervalSecs: 2,
        },
    },

    async requestHandler({ $, request }) {
        // Fast data extraction
        const data = {
            url: request.url,
            items: $('div.item').length,
        };

        await Dataset.pushData(data);
    },
});

3. Resource-Constrained Environments

If you're running your scraper on limited hardware, in Docker containers, or serverless functions, CheerioCrawler's minimal resource footprint is crucial. Unlike browser-based solutions that require Docker configuration, CheerioCrawler runs with minimal dependencies.

4. API-Like HTML Responses

When scraping data from endpoints that return HTML fragments or structured HTML data, CheerioCrawler is perfect:

import { CheerioCrawler, Dataset } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ $, request, json }) {
        // Some APIs return HTML in JSON responses
        const htmlContent = json?.html || request.loadedUrl;

        // Parse the HTML fragment
        const products = [];
        $('.product-card').each((_, el) => {
            products.push({
                id: $(el).data('product-id'),
                name: $(el).find('.name').text(),
                price: $(el).find('.price').text(),
            });
        });

        await Dataset.pushData(products);
    },
});

When NOT to Use CheerioCrawler

CheerioCrawler has limitations that make it unsuitable for certain scenarios:

1. JavaScript-Rendered Content

If the website uses JavaScript frameworks like React, Vue, or Angular to render content dynamically, CheerioCrawler won't be able to extract that data. In these cases, you need PuppeteerCrawler or PlaywrightCrawler.

Example of content that requires a browser: - Single Page Applications (SPAs) - Infinite scroll pages - Content loaded via AJAX after page load - Interactive dashboards and web applications

2. Complex User Interactions

When you need to simulate user interactions like clicking buttons, filling forms, or handling authentication flows, a browser-based crawler is necessary.

3. Pages with Bot Detection

Some websites implement sophisticated bot detection that checks for browser-specific features. CheerioCrawler's plain HTTP requests may be blocked, while browser-based crawlers can bypass these checks more effectively.

Advanced CheerioCrawler Features

Custom HTTP Headers

You can customize request headers to mimic real browsers or include authentication tokens:

const crawler = new CheerioCrawler({
    // Set default headers for all requests
    requestHandlerTimeoutSecs: 60,

    async requestHandler({ request, $, crawler }) {
        // Access the underlying Got HTTP client
    },

    // Use pre-navigation hooks to modify requests
    preNavigationHooks: [
        (crawlingContext, gotoOptions) => {
            gotoOptions.headers = {
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
                'Accept-Language': 'en-US,en;q=0.9',
                'Accept': 'text/html,application/xhtml+xml',
            };
        },
    ],
});

Handling Different Content Types

CheerioCrawler can handle various response types:

import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ request, $, body, contentType }) {
        // Check content type
        if (contentType.includes('application/json')) {
            const data = JSON.parse(body);
            console.log('JSON response:', data);
        } else if (contentType.includes('text/html')) {
            // Parse HTML with Cheerio
            const title = $('title').text();
            console.log('Page title:', title);
        }
    },
});

Proxy Support

CheerioCrawler includes built-in proxy rotation support:

import { CheerioCrawler, ProxyConfiguration } from 'crawlee';

const proxyConfiguration = new ProxyConfiguration({
    proxyUrls: [
        'http://proxy1.example.com:8000',
        'http://proxy2.example.com:8000',
    ],
});

const crawler = new CheerioCrawler({
    proxyConfiguration,

    async requestHandler({ request, $, proxyInfo }) {
        console.log(`Using proxy: ${proxyInfo?.url}`);
        // Your scraping logic here
    },
});

CheerioCrawler vs PuppeteerCrawler

Here's a comparison to help you choose the right crawler:

| Feature | CheerioCrawler | PuppeteerCrawler | |---------|----------------|------------------| | Speed | Very fast (50-100 req/s) | Slower (5-20 req/s) | | Memory Usage | Low (~50MB) | High (~200-500MB per browser) | | JavaScript Support | No | Yes | | Browser Features | No | Full browser capabilities | | Setup Complexity | Simple | Requires browser installation | | Best For | Static HTML sites | JavaScript-heavy sites | | Concurrent Requests | 50-100+ | 5-20 typically |

Python Alternative: BeautifulSoup

If you're working in Python, the equivalent approach uses libraries like BeautifulSoup or lxml:

import requests
from bs4 import BeautifulSoup
from typing import List, Dict

def scrape_with_beautifulsoup(url: str) -> List[Dict]:
    """
    Scrape a page using requests + BeautifulSoup
    (equivalent to CheerioCrawler approach)
    """
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    }

    response = requests.get(url, headers=headers)
    response.raise_for_status()

    soup = BeautifulSoup(response.text, 'html.parser')

    products = []
    for item in soup.select('div.product'):
        products.append({
            'title': item.select_one('h2').get_text(strip=True),
            'price': item.select_one('.price').get_text(strip=True),
            'url': item.select_one('a')['href'],
        })

    return products

# Usage
results = scrape_with_beautifulsoup('https://example.com/products')
print(f'Found {len(results)} products')

Best Practices for CheerioCrawler

1. Optimize Concurrency

Start with conservative concurrency and gradually increase:

const crawler = new CheerioCrawler({
    maxConcurrency: 10,
    autoscaledPoolOptions: {
        desiredConcurrency: 20,
        maxConcurrency: 50,
    },

    async requestHandler({ request, $ }) {
        // Your scraping logic
    },
});

2. Implement Proper Error Handling

Always handle errors gracefully to ensure crawler stability:

const crawler = new CheerioCrawler({
    maxRequestRetries: 3,

    async requestHandler({ request, $ }) {
        try {
            // Your scraping logic
            const data = extractData($);
            await Dataset.pushData(data);
        } catch (error) {
            console.error(`Error processing ${request.url}:`, error);
            throw error; // Retries will be attempted
        }
    },

    failedRequestHandler({ request, error }) {
        console.log(`Final failure for ${request.url}: ${error.message}`);
    },
});

3. Use Request Labels

Organize your crawling logic with request labels:

const crawler = new CheerioCrawler({
    async requestHandler({ request, $, enqueueLinks }) {
        if (request.label === 'CATEGORY') {
            // Handle category pages
            await enqueueLinks({
                selector: 'a.product-link',
                label: 'PRODUCT',
            });
        } else if (request.label === 'PRODUCT') {
            // Handle product pages
            const productData = extractProductData($);
            await Dataset.pushData(productData);
        }
    },
});

await crawler.run([
    { url: 'https://example.com/category', label: 'CATEGORY' },
]);

Conclusion

CheerioCrawler is an excellent choice for scraping static HTML content efficiently. It's fast, lightweight, and perfect for traditional websites that don't rely on JavaScript for content rendering. Use it when you need high performance and the target website serves content in the initial HTML response. For JavaScript-heavy sites or when you need to handle browser events and complex interactions, consider switching to PuppeteerCrawler or PlaywrightCrawler instead.

The key to successful web scraping is choosing the right tool for the job. Start with CheerioCrawler for its speed and efficiency, and only move to browser-based solutions when you encounter JavaScript-rendered content or need advanced browser features.

Table of contents