How do I use Crawlee with Node.js for web scraping?

Crawlee is a powerful web scraping and browser automation library for Node.js, developed by Apify. It provides a unified interface for building reliable web scrapers with built-in features like request routing, automatic retries, proxy rotation, and intelligent crawling. This guide will show you how to use Crawlee with Node.js for efficient web scraping.

Prerequisites

Before getting started with Crawlee, ensure you have:

Node.js version 16 or higher installed
npm or yarn package manager
Basic understanding of JavaScript and async/await syntax
Familiarity with HTML and CSS selectors

Installing Crawlee

First, create a new Node.js project and install Crawlee:

# Create a new project directory
mkdir my-crawler
cd my-crawler

# Initialize a new Node.js project
npm init -y

# Install Crawlee
npm install crawlee

# Install Playwright (for browser-based scraping)
npx playwright install

Crawlee supports multiple HTTP clients and browser automation tools. The main options are:

CheerioCrawler: Lightweight HTTP requests with Cheerio for HTML parsing
PuppeteerCrawler: Uses Puppeteer for browser automation
PlaywrightCrawler: Uses Playwright for modern browser automation

Basic Web Scraping with CheerioCrawler

CheerioCrawler is ideal for scraping static websites that don't require JavaScript execution. Here's a basic example:

import { CheerioCrawler, log } from 'crawlee';

// Create a new CheerioCrawler instance
const crawler = new CheerioCrawler({
    // Maximum number of concurrent requests
    maxConcurrency: 10,

    // Request handler function
    async requestHandler({ request, $, enqueueLinks }) {
        log.info(`Processing: ${request.url}`);

        // Extract data using Cheerio selectors
        const title = $('h1').text().trim();
        const description = $('meta[name="description"]').attr('content');
        const links = [];

        $('a').each((i, el) => {
            links.push($(el).attr('href'));
        });

        // Store the extracted data
        await crawler.pushData({
            url: request.url,
            title,
            description,
            linkCount: links.length,
        });

        // Enqueue additional URLs to crawl
        await enqueueLinks({
            // Only follow links matching this pattern
            globs: ['https://example.com/**'],
            // Exclude certain patterns
            exclude: ['**/archive/**'],
        });
    },

    // Handle failed requests
    failedRequestHandler({ request, error }) {
        log.error(`Request ${request.url} failed: ${error.message}`);
    },
});

// Start the crawler with initial URLs
await crawler.run(['https://example.com']);

// Export the collected data
await crawler.exportData('results.json');

Browser-Based Scraping with PlaywrightCrawler

For websites that require JavaScript execution or browser automation similar to Puppeteer, use PlaywrightCrawler:

import { PlaywrightCrawler, log } from 'crawlee';

const crawler = new PlaywrightCrawler({
    // Launch browser options
    launchContext: {
        launchOptions: {
            headless: true,
            timeout: 60000,
        },
    },

    // Request handler with page context
    async requestHandler({ request, page, enqueueLinks, pushData }) {
        log.info(`Scraping: ${request.url}`);

        // Wait for dynamic content to load
        await page.waitForSelector('.product-list', { timeout: 10000 });

        // Scroll to load lazy-loaded content
        await page.evaluate(() => {
            window.scrollTo(0, document.body.scrollHeight);
        });

        // Wait for additional content
        await page.waitForTimeout(2000);

        // Extract data from the page
        const products = await page.$$eval('.product-item', (items) => {
            return items.map(item => ({
                name: item.querySelector('.product-name')?.textContent.trim(),
                price: item.querySelector('.product-price')?.textContent.trim(),
                image: item.querySelector('img')?.src,
            }));
        });

        // Save the extracted data
        await pushData({
            url: request.url,
            products,
            scrapedAt: new Date().toISOString(),
        });

        // Find and enqueue pagination links
        await enqueueLinks({
            selector: '.pagination a',
        });
    },

    // Maximum number of pages to crawl
    maxRequestsPerCrawl: 100,
});

await crawler.run(['https://example-shop.com/products']);

Advanced Features

Request Queue Management

Crawlee provides built-in request queue management with automatic deduplication:

import { CheerioCrawler, RequestQueue } from 'crawlee';

// Create a named request queue
const requestQueue = await RequestQueue.open('my-queue');

// Add requests manually
await requestQueue.addRequest({
    url: 'https://example.com/page1',
    userData: { category: 'electronics' },
});

await requestQueue.addRequest({
    url: 'https://example.com/page2',
    userData: { category: 'books' },
});

const crawler = new CheerioCrawler({
    requestQueue,
    async requestHandler({ request, $ }) {
        const category = request.userData.category;
        log.info(`Scraping ${category} from ${request.url}`);

        // Your scraping logic here
    },
});

await crawler.run();

Using Proxy Servers

Crawlee makes it easy to rotate proxies and avoid IP blocks:

import { PlaywrightCrawler, ProxyConfiguration } from 'crawlee';

// Configure proxy rotation
const proxyConfiguration = new ProxyConfiguration({
    proxyUrls: [
        'http://proxy1.example.com:8000',
        'http://proxy2.example.com:8000',
        'http://proxy3.example.com:8000',
    ],
});

const crawler = new PlaywrightCrawler({
    proxyConfiguration,
    async requestHandler({ request, page, proxyInfo }) {
        log.info(`Using proxy: ${proxyInfo?.url}`);

        // Your scraping logic
    },
});

await crawler.run(['https://example.com']);

Session Management and Cookie Handling

Maintain sessions across requests with Crawlee's session management:

import { CheerioCrawler, SessionPool } from 'crawlee';

const crawler = new CheerioCrawler({
    useSessionPool: true,
    sessionPoolOptions: {
        maxPoolSize: 20,
        sessionOptions: {
            maxUsageCount: 50, // Retire session after 50 uses
            maxErrorScore: 3,   // Retire session after 3 errors
        },
    },

    async requestHandler({ request, session, $ }) {
        // Access session cookies
        const cookies = session.getCookies();
        log.info(`Session ID: ${session.id}, Cookies: ${cookies.length}`);

        // Set custom cookies
        session.setCookies([
            { name: 'user_preference', value: 'dark_mode', domain: '.example.com' },
        ], 'https://example.com');

        // Your scraping logic
    },
});

await crawler.run(['https://example.com']);

Handling Rate Limiting and Auto-Scaling

Crawlee automatically adjusts concurrency based on system resources:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    // Start with 1 concurrent request
    minConcurrency: 1,
    // Scale up to 20 concurrent requests
    maxConcurrency: 20,
    // Auto-scale based on system load
    autoscaledPoolOptions: {
        desiredConcurrency: 10,
        maxConcurrency: 20,
        systemStatusOptions: {
            // Consider system overloaded at 90% CPU
            maxCpuOverloadedRatio: 0.9,
            // Consider system overloaded at 90% memory
            maxMemoryOverloadedRatio: 0.9,
        },
    },

    async requestHandler({ request, page }) {
        // Your scraping logic
    },
});

await crawler.run(['https://example.com']);

Handling Authentication

For sites requiring login, you can handle authentication before scraping:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    preNavigationHooks: [
        async ({ page, request }) => {
            // Only login on first request
            if (request.url.includes('login')) {
                await page.fill('#username', 'your-username');
                await page.fill('#password', 'your-password');
                await page.click('button[type="submit"]');
                await page.waitForNavigation();
            }
        },
    ],

    async requestHandler({ request, page }) {
        // Scrape authenticated content
        const data = await page.$$eval('.user-content', (elements) => {
            return elements.map(el => el.textContent);
        });

        await page.context().storageState({ path: 'auth.json' });
    },
});

await crawler.run(['https://example.com/login', 'https://example.com/dashboard']);

Data Storage and Export

Crawlee provides flexible data storage options:

import { CheerioCrawler, Dataset } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ request, $, pushData }) {
        const data = {
            url: request.url,
            title: $('h1').text(),
            timestamp: new Date().toISOString(),
        };

        // Push data to default dataset
        await pushData(data);
    },
});

await crawler.run(['https://example.com']);

// Export data in various formats
const dataset = await Dataset.open();
await dataset.exportToJSON('output.json');
await dataset.exportToCSV('output.csv');
await dataset.exportToHTML('output.html');

Error Handling and Retries

Crawlee automatically retries failed requests with exponential backoff:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    // Retry failed requests up to 3 times
    maxRequestRetries: 3,

    // Custom error handling
    errorHandler({ error, request }) {
        log.error(`Request ${request.url} failed: ${error.message}`);

        // Mark certain errors as non-retryable
        if (error.message.includes('404')) {
            request.noRetry = true;
        }
    },

    async requestHandler({ request, page }) {
        try {
            // Your scraping logic with timeout
            await page.waitForSelector('.content', { timeout: 30000 });
        } catch (error) {
            if (error.message.includes('timeout')) {
                log.warning(`Timeout on ${request.url}, will retry`);
                throw error; // Let Crawlee retry
            }
        }
    },
});

await crawler.run(['https://example.com']);

Best Practices

1. Use Appropriate Crawler Type

Use CheerioCrawler for static HTML pages (faster, lower memory)
Use PlaywrightCrawler for JavaScript-heavy sites
Use PuppeteerCrawler if you're already familiar with Puppeteer

2. Implement Proper Selectors

// Use specific selectors to avoid brittle scrapers
const products = await page.$$eval('[data-testid="product-item"]', (items) => {
    return items.map(item => ({
        id: item.getAttribute('data-product-id'),
        name: item.querySelector('[data-testid="product-name"]')?.textContent,
    }));
});

3. Handle Dynamic Content

When scraping single-page applications, wait for content to load:

await page.waitForSelector('.dynamic-content', {
    state: 'visible',
    timeout: 10000
});

// Or wait for network idle
await page.waitForLoadState('networkidle');

4. Monitor and Debug

Enable detailed logging for debugging:

import { log, LogLevel } from 'crawlee';

// Set log level
log.setLevel(LogLevel.DEBUG);

// Add custom logging
log.debug('Debug message');
log.info('Info message');
log.warning('Warning message');
log.error('Error message');

Conclusion

Crawlee provides a robust, production-ready framework for web scraping with Node.js. Its built-in features like automatic retries, proxy rotation, session management, and intelligent crawling make it an excellent choice for both simple and complex scraping projects. By following the examples and best practices in this guide, you can build reliable and efficient web scrapers that handle real-world challenges.

For more advanced scenarios requiring browser automation, consider exploring how Crawlee integrates with Playwright and Puppeteer for handling dynamic content, authentication, and complex user interactions.

Table of contents

How do I use Crawlee with Node.js for web scraping?

Prerequisites

Installing Crawlee

Basic Web Scraping with CheerioCrawler

Browser-Based Scraping with PlaywrightCrawler

Advanced Features

Request Queue Management

Using Proxy Servers

Session Management and Cookie Handling

Handling Rate Limiting and Auto-Scaling

Handling Authentication

Data Storage and Export

Error Handling and Retries

Best Practices

1. Use Appropriate Crawler Type

2. Implement Proper Selectors

3. Handle Dynamic Content

4. Monitor and Debug

Conclusion

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Related Questions

Can I use Crawlee with TypeScript for type-safe web scraping?

What is the best way to build a JavaScript web scraper with Crawlee?

How do I set up a Node.js web crawler using Crawlee?

Get Started Now

Support