Table of contents

What are the main features of Crawlee for web scraping?

Crawlee is a modern web scraping and browser automation library developed by Apify. It's designed to make building reliable crawlers and scrapers easier by providing a robust set of features that handle common challenges like rate limiting, proxy rotation, and request management. Originally built for Node.js, Crawlee has become a popular choice for developers who need production-grade web scraping capabilities.

Core Features of Crawlee

1. Unified API for Multiple Crawling Modes

Crawlee provides three main crawler types that share a consistent API:

  • CheerioCrawler: Fast HTTP crawler for static HTML content
  • PlaywrightCrawler: Full-featured browser crawler using Playwright
  • PuppeteerCrawler: Full-featured browser crawler using Puppeteer

This unified approach means you can switch between crawlers with minimal code changes:

import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ request, $, enqueueLinks }) {
        const title = $('title').text();
        console.log(`Title of ${request.url}: ${title}`);

        // Automatically enqueue all links found on the page
        await enqueueLinks();
    },
});

await crawler.run(['https://example.com']);

For JavaScript-heavy sites, switch to PlaywrightCrawler:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    async requestHandler({ request, page, enqueueLinks }) {
        const title = await page.title();
        console.log(`Title of ${request.url}: ${title}`);

        await enqueueLinks();
    },
});

await crawler.run(['https://example.com']);

2. Automatic Request Queue Management

Crawlee includes a sophisticated request queue system that automatically manages URLs to crawl. The queue handles:

  • Deduplication: Automatically prevents crawling the same URL multiple times
  • Persistence: Saves queue state to disk or cloud storage
  • Priority handling: Allows prioritizing certain requests
  • Request retries: Automatically retries failed requests with exponential backoff
import { PlaywrightCrawler, Dataset } from 'crawlee';

const crawler = new PlaywrightCrawler({
    maxRequestsPerCrawl: 100,
    async requestHandler({ request, page, enqueueLinks }) {
        // Extract data
        const data = await page.evaluate(() => {
            return {
                title: document.title,
                heading: document.querySelector('h1')?.textContent,
                description: document.querySelector('meta[name="description"]')?.content,
            };
        });

        // Save data to default dataset
        await Dataset.pushData(data);

        // Add more URLs to the queue
        await enqueueLinks({
            globs: ['https://example.com/blog/**'],
        });
    },
});

await crawler.run(['https://example.com']);

3. Built-in Proxy Rotation and Session Management

Crawlee handles proxy rotation automatically, which is essential for avoiding blocks and bypassing rate limits:

import { PlaywrightCrawler } from 'crawlee';
import { ProxyConfiguration } from 'crawlee';

const proxyConfiguration = new ProxyConfiguration({
    proxyUrls: [
        'http://proxy1.example.com:8000',
        'http://proxy2.example.com:8000',
        'http://proxy3.example.com:8000',
    ],
});

const crawler = new PlaywrightCrawler({
    proxyConfiguration,
    sessionPoolOptions: {
        maxPoolSize: 20,
        sessionOptions: {
            maxUsageCount: 50, // Retire session after 50 uses
        },
    },
    async requestHandler({ request, page }) {
        // Crawlee automatically rotates proxies and manages sessions
        const content = await page.content();
        console.log(`Fetched ${request.url} through proxy`);
    },
});

await crawler.run(['https://example.com']);

4. Smart Request Throttling and AutoScaling

Crawlee automatically adjusts concurrency based on system resources and response times:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    autoscaledPoolOptions: {
        minConcurrency: 1,
        maxConcurrency: 50,
        desiredConcurrency: 10,
        // Automatically scales based on CPU and memory usage
    },
    maxRequestsPerMinute: 120,
    async requestHandler({ request, page }) {
        // Your scraping logic here
    },
});

await crawler.run(['https://example.com']);

The AutoScaling feature monitors: - System CPU usage - Memory consumption - Request success rates - Response times

It automatically adjusts the number of concurrent requests to optimize performance without overwhelming your system or the target website.

5. Request Interception and Blocking

Crawlee allows you to block unnecessary resources to speed up crawling, similar to handling network requests in Puppeteer:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    preNavigationHooks: [
        async ({ page, request }) => {
            // Block images, stylesheets, and fonts
            await page.route('**/*', (route) => {
                const resourceType = route.request().resourceType();
                if (['image', 'stylesheet', 'font'].includes(resourceType)) {
                    route.abort();
                } else {
                    route.continue();
                }
            });
        },
    ],
    async requestHandler({ request, page }) {
        // Faster scraping without loading unnecessary resources
    },
});

await crawler.run(['https://example.com']);

6. Data Storage Options

Crawlee provides multiple built-in storage options:

import { PlaywrightCrawler, Dataset, KeyValueStore } from 'crawlee';

const crawler = new PlaywrightCrawler({
    async requestHandler({ request, page }) {
        // Extract product data
        const product = await page.evaluate(() => ({
            name: document.querySelector('.product-name')?.textContent,
            price: document.querySelector('.price')?.textContent,
            description: document.querySelector('.description')?.textContent,
        }));

        // Save to dataset (append-only storage)
        await Dataset.pushData(product);

        // Save screenshots or files to key-value store
        const screenshot = await page.screenshot();
        await KeyValueStore.setValue(`screenshot-${request.url}`, screenshot);
    },
});

await crawler.run(['https://example.com/products']);

// Export data after crawling
const dataset = await Dataset.open();
await dataset.exportToJSON('products');

7. Error Handling and Retry Logic

Crawlee includes robust error handling with automatic retries:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    maxRequestRetries: 3,
    requestHandlerTimeoutSecs: 60,
    async requestHandler({ request, page, log }) {
        try {
            await page.goto(request.url, { waitUntil: 'networkidle' });

            // Your scraping logic
            const data = await page.evaluate(() => ({
                // Extract data
            }));

        } catch (error) {
            log.error(`Error processing ${request.url}`, { error });
            throw error; // Crawlee will retry automatically
        }
    },
    async failedRequestHandler({ request, log }) {
        log.error(`Request failed after all retries: ${request.url}`);
    },
});

await crawler.run(['https://example.com']);

8. TypeScript Support

Crawlee is written in TypeScript and provides excellent type safety:

import { PlaywrightCrawler, Dataset } from 'crawlee';

interface Product {
    name: string;
    price: number;
    inStock: boolean;
}

const crawler = new PlaywrightCrawler({
    async requestHandler({ request, page }) {
        const product: Product = await page.evaluate(() => ({
            name: document.querySelector('.product-name')?.textContent || '',
            price: parseFloat(document.querySelector('.price')?.textContent || '0'),
            inStock: document.querySelector('.in-stock') !== null,
        }));

        await Dataset.pushData<Product>(product);
    },
});

await crawler.run(['https://example.com/products']);

9. Hooks and Middleware

Crawlee provides lifecycle hooks for customizing behavior:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    preNavigationHooks: [
        async ({ page, request }) => {
            // Set custom headers or cookies before navigation
            await page.setExtraHTTPHeaders({
                'Accept-Language': 'en-US,en;q=0.9',
            });
        },
    ],
    postNavigationHooks: [
        async ({ page, request }) => {
            // Wait for specific conditions after page load
            await page.waitForSelector('.content-loaded');
        },
    ],
    async requestHandler({ request, page }) {
        // Main scraping logic
    },
});

await crawler.run(['https://example.com']);

10. Sitemap and Robot.txt Support

Crawlee can automatically parse and respect robots.txt files and extract URLs from sitemaps:

import { CheerioCrawler, EnqueueStrategy } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ request, $, enqueueLinks }) {
        // Enqueue links while respecting robots.txt
        await enqueueLinks({
            strategy: EnqueueStrategy.All,
            transformRequestFunction: (req) => {
                // Modify requests before adding to queue
                req.userData = { depth: request.userData.depth + 1 };
                return req;
            },
        });
    },
});

// Start from sitemap
await crawler.run(['https://example.com/sitemap.xml']);

Python Support with Crawlee

While Crawlee was originally Node.js-only, the team has recently released a Python version:

from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext

async def main():
    crawler = PlaywrightCrawler(
        max_requests_per_crawl=100,
    )

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        data = {
            'url': context.request.url,
            'title': await context.page.title(),
        }
        await context.push_data(data)
        await context.enqueue_links()

    await crawler.run(['https://example.com'])

if __name__ == '__main__':
    import asyncio
    asyncio.run(main())

Advanced Features

Fingerprint Generation

Crawlee can generate browser fingerprints to avoid detection when automating browsers:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    useSessionPool: true,
    persistCookiesPerSession: true,
    // Generates realistic browser fingerprints
    launchContext: {
        useChrome: true,
        launchOptions: {
            headless: true,
        },
    },
    async requestHandler({ request, page }) {
        // Crawlee automatically rotates fingerprints per session
    },
});

Request Labeling and Routing

Organize different types of requests with labels:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    async requestHandler({ request, page, enqueueLinks }) {
        if (request.label === 'CATEGORY') {
            // Handle category pages
            await enqueueLinks({
                globs: ['**/products/**'],
                label: 'PRODUCT',
            });
        } else if (request.label === 'PRODUCT') {
            // Handle product pages
            const product = await page.evaluate(() => ({
                // Extract product data
            }));
            await Dataset.pushData(product);
        }
    },
});

await crawler.run([
    { url: 'https://example.com/category', label: 'CATEGORY' },
]);

Conclusion

Crawlee is a comprehensive web scraping framework that combines the power of browser automation tools like Puppeteer with intelligent features like automatic scaling, proxy rotation, and request management. Its unified API makes it easy to switch between different crawling strategies, while built-in features handle common challenges that would otherwise require significant custom code.

Whether you're building a simple scraper or a production-grade crawling system, Crawlee provides the tools and abstractions needed to create reliable, scalable solutions. The framework's automatic handling of proxies, sessions, retries, and resource optimization allows developers to focus on extraction logic rather than infrastructure concerns.

For developers looking for an all-in-one solution that combines ease of use with enterprise-grade features, Crawlee represents a significant advancement in the web scraping ecosystem.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon