How do I optimize Crawlee performance for faster scraping?

Optimizing Crawlee performance is crucial for large-scale web scraping projects. By fine-tuning concurrency settings, managing resources efficiently, and implementing smart crawling strategies, you can significantly improve scraping speed while maintaining reliability. This guide covers essential techniques to maximize Crawlee's performance.

Understanding Crawlee's Performance Architecture

Crawlee is built with performance in mind, featuring automatic scaling, request queue management, and intelligent retry mechanisms. However, achieving optimal performance requires understanding how to configure these features for your specific use case.

The main performance factors include:

Concurrency: Number of parallel requests
Request queue management: How URLs are prioritized and processed
Resource utilization: Memory and CPU usage
Network efficiency: Request throttling and retry strategies
Data storage: How scraped data is persisted

Optimizing Concurrency Settings

Concurrency is the most impactful factor for scraping speed. Crawlee allows you to control how many requests run simultaneously.

Basic Concurrency Configuration

import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
    maxConcurrency: 50, // Maximum concurrent requests
    minConcurrency: 10, // Minimum concurrent requests
    requestHandlerTimeoutSecs: 60,

    async requestHandler({ request, $, log }) {
        // Your scraping logic
        const title = $('h1').text();
        log.info(`Title: ${title}`);
    },
});

await crawler.run(['https://example.com']);

Dynamic Concurrency with Autoscaling

Crawlee's autoscaling feature automatically adjusts concurrency based on system resources:

import { PuppeteerCrawler, Configuration } from 'crawlee';

const crawler = new PuppeteerCrawler({
    maxConcurrency: 100,
    minConcurrency: 5,

    // Autoscaling configuration
    autoscaledPoolOptions: {
        desiredConcurrency: 50, // Target concurrency
        maxConcurrency: 100,
        minConcurrency: 5,

        // Scaling thresholds
        systemStatusOptions: {
            maxUsedMemoryRatio: 0.8, // Scale down if memory usage exceeds 80%
            maxUsedCpuRatio: 0.85,   // Scale down if CPU usage exceeds 85%
        },
    },

    async requestHandler({ page, request, log }) {
        await page.waitForSelector('h1');
        const title = await page.title();
        log.info(`Scraped: ${title}`);
    },
});

await crawler.run(['https://example.com']);

Choosing the Right Crawler Type

Different crawler types have different performance characteristics:

CheerioCrawler - Fastest Option

For static websites, CheerioCrawler is the fastest because it doesn't require a browser:

import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
    maxConcurrency: 100, // Can handle very high concurrency

    async requestHandler({ request, $, enqueueLinks }) {
        const data = {
            title: $('h1').first().text(),
            description: $('meta[name="description"]').attr('content'),
        };

        await enqueueLinks({
            globs: ['https://example.com/products/*'],
        });
    },
});

Performance tip: Use CheerioCrawler whenever possible, as it's 10-20x faster than browser-based crawlers.

PuppeteerCrawler - For Dynamic Content

When you need to handle AJAX requests or JavaScript-rendered content:

import { PuppeteerCrawler } from 'crawlee';

const crawler = new PuppeteerCrawler({
    maxConcurrency: 20, // Lower concurrency for browser-based scraping

    launchContext: {
        launchOptions: {
            headless: true,
            args: [
                '--no-sandbox',
                '--disable-setuid-sandbox',
                '--disable-dev-shm-usage',
                '--disable-accelerated-2d-canvas',
                '--disable-gpu',
            ],
        },
    },

    async requestHandler({ page, request }) {
        // Wait for dynamic content
        await page.waitForSelector('.product-list', { timeout: 10000 });
        const products = await page.$$eval('.product', elements =>
            elements.map(el => ({
                name: el.querySelector('.name')?.textContent,
                price: el.querySelector('.price')?.textContent,
            }))
        );
    },
});

Request Queue Optimization

Efficient request queue management prevents bottlenecks and ensures smooth crawling.

Using RequestList for Known URLs

When you have a predefined list of URLs, use RequestList for better performance:

import { CheerioCrawler, RequestList } from 'crawlee';

// Prepare request list
const requestList = await RequestList.open('my-list', [
    'https://example.com/page1',
    'https://example.com/page2',
    // ... thousands of URLs
]);

const crawler = new CheerioCrawler({
    requestList,
    maxConcurrency: 50,

    async requestHandler({ request, $ }) {
        // Your scraping logic
    },
});

await crawler.run();

Request Queue Priority

Prioritize important URLs to scrape them first:

import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ request, $, crawler }) {
        await crawler.addRequests([
            {
                url: 'https://example.com/high-priority',
                userData: { priority: 1 }, // Higher priority
            },
            {
                url: 'https://example.com/low-priority',
                userData: { priority: 10 }, // Lower priority
            },
        ]);
    },
});

Memory Management and Resource Optimization

Proper resource management prevents memory leaks and ensures stable long-running scrapers.

Configuring Storage Options

import { Configuration, PuppeteerCrawler } from 'crawlee';

const config = new Configuration({
    persistStorage: true,
    storageDir: './crawlee_storage',
    purgeOnStart: false, // Keep data between runs
});

const crawler = new PuppeteerCrawler({
    maxRequestsPerCrawl: 10000, // Limit total requests
    maxRequestsPerMinute: 120,  // Rate limiting

    // Close pages after processing to free memory
    browserPoolOptions: {
        maxOpenPagesPerBrowser: 10,
        retireInstanceAfterRequestCount: 100,
    },

    async requestHandler({ page, request }) {
        // Scraping logic
    },
}, config);

Efficient Data Storage

Use datasets efficiently to avoid memory issues:

import { CheerioCrawler, Dataset } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ request, $ }) {
        const data = {
            url: request.url,
            title: $('h1').text(),
            timestamp: new Date().toISOString(),
        };

        // Push data immediately instead of accumulating in memory
        await Dataset.pushData(data);
    },
});

Network Optimization Strategies

Request Throttling

Implement smart throttling to avoid overwhelming target servers:

import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
    maxRequestsPerMinute: 60, // Limit to 60 requests per minute
    maxRequestRetries: 3,     // Retry failed requests up to 3 times

    // Exponential backoff for retries
    maxRequestRetries: 5,
    requestHandlerTimeoutSecs: 60,

    async requestHandler({ request, $ }) {
        // Your scraping logic
    },
});

Session Management

Use session management for better performance with authenticated sites:

import { PuppeteerCrawler } from 'crawlee';

const crawler = new PuppeteerCrawler({
    useSessionPool: true,
    sessionPoolOptions: {
        maxPoolSize: 20,
        sessionOptions: {
            maxUsageCount: 50, // Reuse sessions for multiple requests
        },
    },

    async requestHandler({ page, session }) {
        // Session cookies are automatically managed
        await page.goto(request.url);
    },
});

Advanced Performance Techniques

Disable Unnecessary Resources

For browser-based scrapers, block images, fonts, and other resources:

import { PuppeteerCrawler } from 'crawlee';

const crawler = new PuppeteerCrawler({
    preNavigationHooks: [
        async ({ page, request }) => {
            await page.setRequestInterception(true);

            page.on('request', (req) => {
                const resourceType = req.resourceType();

                // Block images, fonts, and stylesheets
                if (['image', 'font', 'stylesheet'].includes(resourceType)) {
                    req.abort();
                } else {
                    req.continue();
                }
            });
        },
    ],

    async requestHandler({ page }) {
        // Scraping logic - page loads faster without images
    },
});

Parallel Processing with Multiple Crawlers

For very large projects, run multiple crawlers in parallel:

import { CheerioCrawler } from 'crawlee';

async function crawlCategory(category, urls) {
    const crawler = new CheerioCrawler({
        maxConcurrency: 30,
        async requestHandler({ request, $ }) {
            // Category-specific scraping logic
        },
    });

    await crawler.run(urls);
}

// Run multiple crawlers in parallel
await Promise.all([
    crawlCategory('electronics', electronicsUrls),
    crawlCategory('books', booksUrls),
    crawlCategory('clothing', clothingUrls),
]);

Monitoring and Benchmarking

Track performance metrics to identify bottlenecks:

import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ request, $, log }) {
        const startTime = Date.now();

        // Your scraping logic
        const title = $('h1').text();

        const duration = Date.now() - startTime;
        log.info(`Processed ${request.url} in ${duration}ms`);
    },

    async failedRequestHandler({ request, error, log }) {
        log.error(`Request failed: ${request.url}`, { error: error.message });
    },
});

// Access statistics after crawling
await crawler.run(['https://example.com']);
const stats = await crawler.getStats();
console.log(`Total requests: ${stats.requestsFinished}`);
console.log(`Failed requests: ${stats.requestsFailed}`);

Performance Checklist

To maximize Crawlee performance:

Use CheerioCrawler for static content when possible
Tune concurrency based on your system resources and target site
Enable autoscaling to automatically adjust concurrency
Implement request throttling to avoid rate limiting
Use RequestList for known URLs instead of adding them one by one
Manage sessions efficiently for authenticated scraping
Disable unnecessary resources (images, fonts) for browser-based scraping
Limit maxRequestsPerCrawl for memory-constrained environments
Push data immediately to datasets instead of accumulating in memory
Monitor performance metrics to identify and resolve bottlenecks

Conclusion

Optimizing Crawlee performance requires a balanced approach between speed and reliability. Start with conservative concurrency settings and gradually increase them while monitoring system resources. Choose the appropriate crawler type for your use case, implement efficient resource management, and use advanced techniques like request filtering and parallel processing for maximum performance.

By following these optimization strategies, you can build fast, reliable, and scalable web scrapers that efficiently handle large-scale data extraction projects.

Table of contents