What are the main features of Crawlee for web scraping?

Crawlee is a modern web scraping and browser automation library developed by Apify. It's designed to make building reliable crawlers and scrapers easier by providing a robust set of features that handle common challenges like rate limiting, proxy rotation, and request management. Originally built for Node.js, Crawlee has become a popular choice for developers who need production-grade web scraping capabilities.

Core Features of Crawlee

1. Unified API for Multiple Crawling Modes

Crawlee provides three main crawler types that share a consistent API:

CheerioCrawler: Fast HTTP crawler for static HTML content
PlaywrightCrawler: Full-featured browser crawler using Playwright
PuppeteerCrawler: Full-featured browser crawler using Puppeteer

This unified approach means you can switch between crawlers with minimal code changes:

import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ request, $, enqueueLinks }) {
        const title = $('title').text();
        console.log(`Title of ${request.url}: ${title}`);

        // Automatically enqueue all links found on the page
        await enqueueLinks();
    },
});

await crawler.run(['https://example.com']);

For JavaScript-heavy sites, switch to PlaywrightCrawler:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    async requestHandler({ request, page, enqueueLinks }) {
        const title = await page.title();
        console.log(`Title of ${request.url}: ${title}`);

        await enqueueLinks();
    },
});

await crawler.run(['https://example.com']);

2. Automatic Request Queue Management

Crawlee includes a sophisticated request queue system that automatically manages URLs to crawl. The queue handles:

Deduplication: Automatically prevents crawling the same URL multiple times
Persistence: Saves queue state to disk or cloud storage
Priority handling: Allows prioritizing certain requests
Request retries: Automatically retries failed requests with exponential backoff

import { PlaywrightCrawler, Dataset } from 'crawlee';

const crawler = new PlaywrightCrawler({
    maxRequestsPerCrawl: 100,
    async requestHandler({ request, page, enqueueLinks }) {
        // Extract data
        const data = await page.evaluate(() => {
            return {
                title: document.title,
                heading: document.querySelector('h1')?.textContent,
                description: document.querySelector('meta[name="description"]')?.content,
            };
        });

        // Save data to default dataset
        await Dataset.pushData(data);

        // Add more URLs to the queue
        await enqueueLinks({
            globs: ['https://example.com/blog/**'],
        });
    },
});

await crawler.run(['https://example.com']);

3. Built-in Proxy Rotation and Session Management

Crawlee handles proxy rotation automatically, which is essential for avoiding blocks and bypassing rate limits:

import { PlaywrightCrawler } from 'crawlee';
import { ProxyConfiguration } from 'crawlee';

const proxyConfiguration = new ProxyConfiguration({
    proxyUrls: [
        'http://proxy1.example.com:8000',
        'http://proxy2.example.com:8000',
        'http://proxy3.example.com:8000',
    ],
});

const crawler = new PlaywrightCrawler({
    proxyConfiguration,
    sessionPoolOptions: {
        maxPoolSize: 20,
        sessionOptions: {
            maxUsageCount: 50, // Retire session after 50 uses
        },
    },
    async requestHandler({ request, page }) {
        // Crawlee automatically rotates proxies and manages sessions
        const content = await page.content();
        console.log(`Fetched ${request.url} through proxy`);
    },
});

await crawler.run(['https://example.com']);

4. Smart Request Throttling and AutoScaling

Crawlee automatically adjusts concurrency based on system resources and response times:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    autoscaledPoolOptions: {
        minConcurrency: 1,
        maxConcurrency: 50,
        desiredConcurrency: 10,
        // Automatically scales based on CPU and memory usage
    },
    maxRequestsPerMinute: 120,
    async requestHandler({ request, page }) {
        // Your scraping logic here
    },
});

await crawler.run(['https://example.com']);

The AutoScaling feature monitors: - System CPU usage - Memory consumption - Request success rates - Response times

It automatically adjusts the number of concurrent requests to optimize performance without overwhelming your system or the target website.

5. Request Interception and Blocking

Crawlee allows you to block unnecessary resources to speed up crawling, similar to handling network requests in Puppeteer:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    preNavigationHooks: [
        async ({ page, request }) => {
            // Block images, stylesheets, and fonts
            await page.route('**/*', (route) => {
                const resourceType = route.request().resourceType();
                if (['image', 'stylesheet', 'font'].includes(resourceType)) {
                    route.abort();
                } else {
                    route.continue();
                }
            });
        },
    ],
    async requestHandler({ request, page }) {
        // Faster scraping without loading unnecessary resources
    },
});

await crawler.run(['https://example.com']);

6. Data Storage Options

Crawlee provides multiple built-in storage options:

import { PlaywrightCrawler, Dataset, KeyValueStore } from 'crawlee';

const crawler = new PlaywrightCrawler({
    async requestHandler({ request, page }) {
        // Extract product data
        const product = await page.evaluate(() => ({
            name: document.querySelector('.product-name')?.textContent,
            price: document.querySelector('.price')?.textContent,
            description: document.querySelector('.description')?.textContent,
        }));

        // Save to dataset (append-only storage)
        await Dataset.pushData(product);

        // Save screenshots or files to key-value store
        const screenshot = await page.screenshot();
        await KeyValueStore.setValue(`screenshot-${request.url}`, screenshot);
    },
});

await crawler.run(['https://example.com/products']);

// Export data after crawling
const dataset = await Dataset.open();
await dataset.exportToJSON('products');

7. Error Handling and Retry Logic

Crawlee includes robust error handling with automatic retries:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    maxRequestRetries: 3,
    requestHandlerTimeoutSecs: 60,
    async requestHandler({ request, page, log }) {
        try {
            await page.goto(request.url, { waitUntil: 'networkidle' });

            // Your scraping logic
            const data = await page.evaluate(() => ({
                // Extract data
            }));

        } catch (error) {
            log.error(`Error processing ${request.url}`, { error });
            throw error; // Crawlee will retry automatically
        }
    },
    async failedRequestHandler({ request, log }) {
        log.error(`Request failed after all retries: ${request.url}`);
    },
});

await crawler.run(['https://example.com']);

8. TypeScript Support

Crawlee is written in TypeScript and provides excellent type safety:

import { PlaywrightCrawler, Dataset } from 'crawlee';

interface Product {
    name: string;
    price: number;
    inStock: boolean;
}

const crawler = new PlaywrightCrawler({
    async requestHandler({ request, page }) {
        const product: Product = await page.evaluate(() => ({
            name: document.querySelector('.product-name')?.textContent || '',
            price: parseFloat(document.querySelector('.price')?.textContent || '0'),
            inStock: document.querySelector('.in-stock') !== null,
        }));

        await Dataset.pushData<Product>(product);
    },
});

await crawler.run(['https://example.com/products']);

9. Hooks and Middleware

Crawlee provides lifecycle hooks for customizing behavior:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    preNavigationHooks: [
        async ({ page, request }) => {
            // Set custom headers or cookies before navigation
            await page.setExtraHTTPHeaders({
                'Accept-Language': 'en-US,en;q=0.9',
            });
        },
    ],
    postNavigationHooks: [
        async ({ page, request }) => {
            // Wait for specific conditions after page load
            await page.waitForSelector('.content-loaded');
        },
    ],
    async requestHandler({ request, page }) {
        // Main scraping logic
    },
});

await crawler.run(['https://example.com']);

10. Sitemap and Robot.txt Support

Crawlee can automatically parse and respect robots.txt files and extract URLs from sitemaps:

import { CheerioCrawler, EnqueueStrategy } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ request, $, enqueueLinks }) {
        // Enqueue links while respecting robots.txt
        await enqueueLinks({
            strategy: EnqueueStrategy.All,
            transformRequestFunction: (req) => {
                // Modify requests before adding to queue
                req.userData = { depth: request.userData.depth + 1 };
                return req;
            },
        });
    },
});

// Start from sitemap
await crawler.run(['https://example.com/sitemap.xml']);

Python Support with Crawlee

While Crawlee was originally Node.js-only, the team has recently released a Python version:

from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext

async def main():
    crawler = PlaywrightCrawler(
        max_requests_per_crawl=100,
    )

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        data = {
            'url': context.request.url,
            'title': await context.page.title(),
        }
        await context.push_data(data)
        await context.enqueue_links()

    await crawler.run(['https://example.com'])

if __name__ == '__main__':
    import asyncio
    asyncio.run(main())

Advanced Features

Fingerprint Generation

Crawlee can generate browser fingerprints to avoid detection when automating browsers:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    useSessionPool: true,
    persistCookiesPerSession: true,
    // Generates realistic browser fingerprints
    launchContext: {
        useChrome: true,
        launchOptions: {
            headless: true,
        },
    },
    async requestHandler({ request, page }) {
        // Crawlee automatically rotates fingerprints per session
    },
});

Request Labeling and Routing

Organize different types of requests with labels:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    async requestHandler({ request, page, enqueueLinks }) {
        if (request.label === 'CATEGORY') {
            // Handle category pages
            await enqueueLinks({
                globs: ['**/products/**'],
                label: 'PRODUCT',
            });
        } else if (request.label === 'PRODUCT') {
            // Handle product pages
            const product = await page.evaluate(() => ({
                // Extract product data
            }));
            await Dataset.pushData(product);
        }
    },
});

await crawler.run([
    { url: 'https://example.com/category', label: 'CATEGORY' },
]);

Conclusion

Crawlee is a comprehensive web scraping framework that combines the power of browser automation tools like Puppeteer with intelligent features like automatic scaling, proxy rotation, and request management. Its unified API makes it easy to switch between different crawling strategies, while built-in features handle common challenges that would otherwise require significant custom code.

Whether you're building a simple scraper or a production-grade crawling system, Crawlee provides the tools and abstractions needed to create reliable, scalable solutions. The framework's automatic handling of proxies, sessions, retries, and resource optimization allows developers to focus on extraction logic rather than infrastructure concerns.

For developers looking for an all-in-one solution that combines ease of use with enterprise-grade features, Crawlee represents a significant advancement in the web scraping ecosystem.

Table of contents

What are the main features of Crawlee for web scraping?

Core Features of Crawlee

1. Unified API for Multiple Crawling Modes

2. Automatic Request Queue Management

3. Built-in Proxy Rotation and Session Management

4. Smart Request Throttling and AutoScaling

5. Request Interception and Blocking

6. Data Storage Options

7. Error Handling and Retry Logic

8. TypeScript Support

9. Hooks and Middleware

10. Sitemap and Robot.txt Support

Python Support with Crawlee

Advanced Features

Fingerprint Generation

Request Labeling and Routing

Conclusion

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Related Questions

What programming languages does Crawlee support?

How does Crawlee compare to Scrapy for web scraping?

What are the differences between Crawlee and BeautifulSoup?

Get Started Now

Support