Table of contents

How do I Choose Between Different Crawler Types in Crawlee?

Crawlee offers three main crawler types—CheerioCrawler, PuppeteerCrawler, and PlaywrightCrawler—each designed for specific web scraping scenarios. Choosing the right crawler depends on factors like website complexity, JavaScript rendering requirements, performance needs, and resource constraints. This guide will help you understand the strengths and use cases of each crawler type to make an informed decision.

Understanding Crawlee's Crawler Types

CheerioCrawler: Lightweight HTML Parsing

CheerioCrawler is the fastest and most resource-efficient option in Crawlee. It uses the Cheerio library to parse static HTML content without rendering JavaScript, making it ideal for simple websites that don't rely heavily on client-side rendering.

Key characteristics: - Speed: Extremely fast—can handle thousands of requests per minute - Resource usage: Minimal CPU and memory consumption - JavaScript: Does not execute JavaScript - Best for: Static HTML websites, APIs, and server-rendered content

When to use CheerioCrawler: - The target website serves fully-rendered HTML from the server - Content is available in the initial HTML response (no dynamic loading) - You need maximum speed and efficiency - You're scraping large volumes of simple pages - The website doesn't use JavaScript for content rendering

Example:

import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ request, $, enqueueLinks }) {
        const title = $('title').text();
        const articles = $('article h2').map((i, el) => $(el).text()).get();

        console.log(`Title: ${title}`);
        console.log(`Articles found: ${articles.length}`);

        // Enqueue pagination links
        await enqueueLinks({
            selector: 'a.next-page',
        });
    },
});

await crawler.run(['https://example.com/blog']);

PuppeteerCrawler: Full Browser Automation with Chromium

PuppeteerCrawler uses Google's Puppeteer library to control a headless Chromium browser. It executes JavaScript, handles dynamic content, and can interact with web pages just like a real user.

Key characteristics: - Speed: Moderate—slower than Cheerio but faster than Playwright - Resource usage: High CPU and memory consumption (runs full browser) - JavaScript: Full JavaScript execution and DOM manipulation - Browser: Chromium only - Best for: JavaScript-heavy websites, SPAs, and sites requiring user interactions

When to use PuppeteerCrawler: - Content is loaded dynamically via JavaScript - You need to interact with the page (clicks, form submissions, scrolling) - The website uses AJAX requests to fetch data - You need to handle browser sessions or cookies - You're comfortable with Chromium-only support

Example:

import { PuppeteerCrawler } from 'crawlee';

const crawler = new PuppeteerCrawler({
    async requestHandler({ request, page, enqueueLinks }) {
        // Wait for dynamic content to load
        await page.waitForSelector('.product-list');

        // Scroll to trigger lazy loading
        await page.evaluate(() => {
            window.scrollTo(0, document.body.scrollHeight);
        });

        await page.waitForTimeout(2000);

        // Extract data from JavaScript-rendered content
        const products = await page.$$eval('.product-item', items => {
            return items.map(item => ({
                name: item.querySelector('h3')?.textContent,
                price: item.querySelector('.price')?.textContent,
            }));
        });

        console.log(`Found ${products.length} products`);

        // Enqueue next pages
        await enqueueLinks({
            selector: 'a.pagination-link',
        });
    },
    headless: true,
});

await crawler.run(['https://example.com/products']);

PlaywrightCrawler: Cross-Browser Automation

PlaywrightCrawler uses Microsoft's Playwright library, which supports multiple browser engines (Chromium, Firefox, and WebKit). It offers the most comprehensive browser automation capabilities.

Key characteristics: - Speed: Slowest due to comprehensive browser support - Resource usage: High CPU and memory consumption - JavaScript: Full JavaScript execution and DOM manipulation - Browsers: Chromium, Firefox, and WebKit (Safari) - Best for: Cross-browser testing, complex interactions, and maximum compatibility

When to use PlaywrightCrawler: - You need cross-browser compatibility testing - The website behaves differently across browsers - You need WebKit/Safari-specific features - You require advanced browser automation features - You want to handle AJAX requests across different browsers

Example:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    async requestHandler({ request, page, enqueueLinks }) {
        // Handle authentication modal
        const loginButton = await page.$('button.login');
        if (loginButton) {
            await loginButton.click();
            await page.fill('input[name="email"]', 'user@example.com');
            await page.fill('input[name="password"]', 'password');
            await page.click('button[type="submit"]');
            await page.waitForNavigation();
        }

        // Extract data after authentication
        const data = await page.evaluate(() => {
            return {
                userInfo: document.querySelector('.user-profile')?.textContent,
                notifications: Array.from(document.querySelectorAll('.notification'))
                    .map(n => n.textContent),
            };
        });

        console.log(`User data:`, data);
    },
    launchContext: {
        launcher: 'firefox', // Can use 'chromium', 'firefox', or 'webkit'
    },
});

await crawler.run(['https://example.com/dashboard']);

Decision Matrix: Choosing the Right Crawler

Performance Comparison

| Crawler Type | Speed | Memory Usage | CPU Usage | JavaScript Support | |-------------|-------|--------------|-----------|-------------------| | CheerioCrawler | ⚡⚡⚡⚡⚡ | 💾 | 🔥 | ❌ | | PuppeteerCrawler | ⚡⚡⚡ | 💾💾💾💾 | 🔥🔥🔥🔥 | ✅ | | PlaywrightCrawler | ⚡⚡ | 💾💾💾💾💾 | 🔥🔥🔥🔥🔥 | ✅ |

Use Case Decision Tree

Does the website require JavaScript to render content?
├─ NO → Use CheerioCrawler
│   └─ Fastest, most efficient option
│
└─ YES → Does the website work only on specific browsers?
    ├─ NO → Use PuppeteerCrawler
    │   └─ Good balance of features and performance
    │
    └─ YES → Use PlaywrightCrawler
        └─ Cross-browser support

Python Equivalent with Crawlee for Python

Crawlee also has a Python version with similar crawler types. Here's how to choose between them in Python:

CheerioCrawler in Python (BeautifulSoupCrawler)

from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler

async def main():
    crawler = BeautifulSoupCrawler()

    @crawler.router.default_handler
    async def request_handler(context):
        title = context.soup.find('title').text
        articles = [h2.text for h2 in context.soup.select('article h2')]

        print(f'Title: {title}')
        print(f'Articles found: {len(articles)}')

        await context.enqueue_links(selector='a.next-page')

    await crawler.run(['https://example.com/blog'])

PuppeteerCrawler/PlaywrightCrawler in Python

from crawlee.playwright_crawler import PlaywrightCrawler

async def main():
    crawler = PlaywrightCrawler()

    @crawler.router.default_handler
    async def request_handler(context):
        page = context.page

        # Wait for dynamic content
        await page.wait_for_selector('.product-list')

        # Extract data
        products = await page.query_selector_all('.product-item')
        product_data = []

        for product in products:
            name = await product.query_selector('h3')
            price = await product.query_selector('.price')
            product_data.append({
                'name': await name.text_content() if name else None,
                'price': await price.text_content() if price else None,
            })

        print(f'Found {len(product_data)} products')

        await context.enqueue_links(selector='a.pagination-link')

    await crawler.run(['https://example.com/products'])

Advanced Considerations

Hybrid Approach

You can use multiple crawler types in the same project. Start with CheerioCrawler for initial discovery, then switch to a browser-based crawler for JavaScript-heavy pages:

import { CheerioCrawler, PuppeteerCrawler } from 'crawlee';

const cheerioCrawler = new CheerioCrawler({
    async requestHandler({ request, $, enqueueLinks }) {
        // Extract links to product pages
        const productLinks = $('a.product-link')
            .map((i, el) => $(el).attr('href'))
            .get();

        // Add to browser crawler queue with different label
        await enqueueLinks({
            urls: productLinks,
            label: 'PRODUCT_DETAIL',
        });
    },
});

const puppeteerCrawler = new PuppeteerCrawler({
    async requestHandler({ request, page }) {
        // Only process product detail pages
        if (request.label === 'PRODUCT_DETAIL') {
            await page.waitForSelector('.product-details');
            const details = await page.evaluate(() => {
                return {
                    description: document.querySelector('.description')?.textContent,
                    specifications: document.querySelector('.specs')?.innerHTML,
                };
            });
            console.log('Product details:', details);
        }
    },
});

Resource Management

When using browser-based crawlers, configure resource limits to prevent system overload:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    maxConcurrency: 5, // Limit concurrent browser instances
    requestHandlerTimeoutSecs: 60, // Prevent hanging requests
    navigationTimeoutSecs: 30, // Timeout for page navigation

    launchContext: {
        launchOptions: {
            headless: true,
            args: [
                '--no-sandbox',
                '--disable-setuid-sandbox',
                '--disable-dev-shm-usage', // Overcome limited resource problems
            ],
        },
    },

    async requestHandler({ request, page }) {
        // Your scraping logic
    },
});

Practical Recommendations

Start Simple, Scale Up

  1. Test with CheerioCrawler first: Always try the lightweight option to see if it works
  2. Upgrade to PuppeteerCrawler: If you need JavaScript rendering
  3. Use PlaywrightCrawler: Only when you need cross-browser support or specific browser features

Monitor and Optimize

  • Track execution time: Measure how long each crawler takes
  • Monitor memory usage: Browser-based crawlers can consume significant RAM
  • Adjust concurrency: Lower concurrency for resource-intensive crawlers
  • Use request filtering: Don't load unnecessary resources (images, CSS) when not needed

Cost Considerations

Running browser-based crawlers (Puppeteer/Playwright) costs more in terms of: - Server resources: Higher CPU and memory requirements - Cloud hosting: More expensive instances needed - Execution time: Slower scraping means longer running jobs

CheerioCrawler can be 10-50x faster and use 90% less memory than browser-based alternatives.

Conclusion

Choosing the right Crawlee crawler type is crucial for building efficient web scrapers. Use CheerioCrawler for speed and efficiency with static HTML, PuppeteerCrawler for JavaScript-heavy websites with Chromium, and PlaywrightCrawler when you need cross-browser compatibility. Consider starting with the simplest option and upgrading only when necessary to balance performance with functionality.

For more advanced scenarios involving browser automation, you may also want to explore how to navigate to different pages using Puppeteer or learn about handling browser events for more complex interactions.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon