How do I set up a Node.js web crawler using Crawlee?

Crawlee is a powerful web scraping and browser automation library for Node.js that provides a unified interface for building reliable crawlers. Setting up a Node.js web crawler with Crawlee is straightforward and offers built-in features like request queuing, automatic retries, and proxy rotation.

Prerequisites

Before setting up Crawlee, ensure you have:

Node.js (version 16 or higher)
npm or yarn package manager
Basic understanding of JavaScript/TypeScript and async/await

Installing Crawlee

First, create a new Node.js project and install Crawlee:

# Create a new project directory
mkdir my-crawler
cd my-crawler

# Initialize a new Node.js project
npm init -y

# Install Crawlee
npm install crawlee

# Install Playwright for browser automation (optional)
npm install playwright

Crawlee supports multiple HTTP clients and browser automation tools: - Cheerio - Fast HTML parsing without a browser - Puppeteer - Chrome/Chromium automation - Playwright - Multi-browser automation (Chrome, Firefox, Safari)

Basic Crawlee Setup with CheerioCrawler

For simple HTML scraping without JavaScript rendering, use CheerioCrawler:

const { CheerioCrawler, Dataset } = require('crawlee');

// Create a new crawler instance
const crawler = new CheerioCrawler({
    // Maximum number of concurrent requests
    maxConcurrency: 10,

    // Request handler - processes each page
    async requestHandler({ request, $, log }) {
        log.info(`Processing ${request.url}...`);

        // Extract data using jQuery-like syntax
        const title = $('title').text();
        const headings = [];

        $('h1, h2').each((index, element) => {
            headings.push($(element).text().trim());
        });

        // Save extracted data
        await Dataset.pushData({
            url: request.url,
            title,
            headings,
        });

        // Enqueue new URLs found on the page
        await crawler.addRequests(
            $('a[href]').map((_, el) => $(el).attr('href')).get()
        );
    },

    // Handle failed requests
    failedRequestHandler({ request, log }) {
        log.error(`Request ${request.url} failed too many times.`);
    },
});

// Run the crawler
await crawler.run(['https://example.com']);

Setting Up Crawlee with Playwright

For JavaScript-heavy websites, use PlaywrightCrawler to render pages in a real browser:

const { PlaywrightCrawler, Dataset } = require('crawlee');

const crawler = new PlaywrightCrawler({
    // Launch browser in headless mode
    headless: true,

    // Browser type: 'chromium', 'firefox', or 'webkit'
    browserPoolOptions: {
        useFingerprints: true, // Avoid bot detection
    },

    async requestHandler({ page, request, log, enqueueLinks }) {
        log.info(`Scraping ${request.url}...`);

        // Wait for dynamic content to load
        await page.waitForLoadState('networkidle');

        // Extract data from the page
        const data = await page.evaluate(() => {
            return {
                title: document.querySelector('title')?.textContent,
                description: document.querySelector('meta[name="description"]')?.content,
                links: Array.from(document.querySelectorAll('a')).map(a => a.href),
            };
        });

        // Save the data
        await Dataset.pushData({
            url: request.url,
            ...data,
        });

        // Find and enqueue new links
        await enqueueLinks({
            selector: 'a[href]',
            strategy: 'same-domain', // Only crawl same domain
        });
    },
});

await crawler.run(['https://example.com']);

Advanced Configuration

Request Queue and Storage

Crawlee automatically manages request queues and data storage:

const { PlaywrightCrawler, Dataset, RequestQueue } = require('crawlee');

// Initialize a named request queue
const requestQueue = await RequestQueue.open('my-queue');

// Add initial URLs
await requestQueue.addRequest({
    url: 'https://example.com',
    userData: { depth: 0 } // Custom metadata
});

const crawler = new PlaywrightCrawler({
    requestQueue,

    async requestHandler({ request, page, log, enqueueLinks }) {
        const { depth } = request.userData;

        // Limit crawl depth
        if (depth >= 3) {
            log.info(`Max depth reached for ${request.url}`);
            return;
        }

        // Extract and save data
        const data = await page.evaluate(() => ({
            title: document.title,
            url: window.location.href,
        }));

        await Dataset.pushData(data);

        // Enqueue links with incremented depth
        await enqueueLinks({
            transformRequestFunction: (req) => {
                req.userData = { depth: depth + 1 };
                return req;
            },
        });
    },
});

await crawler.run();

Proxy Configuration

Add proxy support to avoid IP blocking:

const { PlaywrightCrawler, ProxyConfiguration } = require('crawlee');

// Configure proxy rotation
const proxyConfiguration = new ProxyConfiguration({
    proxyUrls: [
        'http://proxy1.example.com:8000',
        'http://proxy2.example.com:8000',
    ],
    // Or use proxy services like Apify Proxy
    // proxyUrls: ['http://groups-RESIDENTIAL:password@proxy.apify.com:8000'],
});

const crawler = new PlaywrightCrawler({
    proxyConfiguration,
    useSessionPool: true, // Maintain sessions across requests

    async requestHandler({ request, page, log }) {
        log.info(`Processing ${request.url} via proxy...`);
        // Your scraping logic here
    },
});

await crawler.run(['https://example.com']);

Session Management and Cookies

Handle authentication and sessions:

const { PlaywrightCrawler } = require('crawlee');

const crawler = new PlaywrightCrawler({
    useSessionPool: true,
    persistCookiesPerSession: true,

    async requestHandler({ page, session, log }) {
        // Check if session is blocked
        if (session.isBlocked()) {
            log.warning('Session blocked, retiring...');
            session.retire();
            return;
        }

        // Handle login if needed
        if (!session.userData.loggedIn) {
            await page.goto('https://example.com/login');
            await page.fill('#username', 'user@example.com');
            await page.fill('#password', 'password');
            await page.click('button[type="submit"]');

            session.userData.loggedIn = true;
        }

        // Continue scraping authenticated pages
    },
});

Error Handling and Retries

Crawlee provides robust error handling:

const crawler = new PlaywrightCrawler({
    maxRequestRetries: 3,
    maxRequestsPerMinute: 120,

    async requestHandler({ request, page, log }) {
        try {
            await page.goto(request.url, {
                waitUntil: 'networkidle',
                timeout: 30000
            });

            // Your scraping logic

        } catch (error) {
            log.error(`Error processing ${request.url}: ${error.message}`);
            throw error; // Will trigger retry
        }
    },

    failedRequestHandler({ request, log }) {
        log.error(`Failed to process ${request.url} after retries`);
    },
});

Working with TypeScript

Crawlee has excellent TypeScript support:

import { PlaywrightCrawler, Dataset } from 'crawlee';

interface ProductData {
    url: string;
    title: string;
    price: number;
    availability: boolean;
}

const crawler = new PlaywrightCrawler({
    async requestHandler({ page, request, log, enqueueLinks }) {
        log.info(`Processing ${request.url}`);

        const data: ProductData = await page.evaluate(() => {
            return {
                url: window.location.href,
                title: document.querySelector('h1.product-title')?.textContent || '',
                price: parseFloat(document.querySelector('.price')?.textContent || '0'),
                availability: !!document.querySelector('.in-stock'),
            };
        });

        await Dataset.pushData<ProductData>(data);

        await enqueueLinks({
            selector: 'a.product-link',
            label: 'PRODUCT',
        });
    },
});

await crawler.run(['https://shop.example.com']);

Exporting Data

Export scraped data in various formats:

const { Dataset } = require('crawlee');

// After crawling is complete
const dataset = await Dataset.open('my-results');

// Export to JSON
const data = await dataset.getData();
console.log(data.items);

// Export to CSV
await dataset.exportToCSV('results.csv');

// Export to JSON file
await dataset.exportToJSON('results.json');

// Get data in chunks for large datasets
const { items } = await dataset.getData({
    offset: 0,
    limit: 100
});

Complete Example: E-commerce Crawler

Here's a production-ready example for crawling an e-commerce site:

const { PlaywrightCrawler, Dataset, ProxyConfiguration } = require('crawlee');

const proxyConfiguration = new ProxyConfiguration({
    proxyUrls: process.env.PROXY_URLS?.split(',') || [],
});

const crawler = new PlaywrightCrawler({
    proxyConfiguration,
    maxConcurrency: 5,
    maxRequestsPerMinute: 60,
    useSessionPool: true,

    async requestHandler({ page, request, log, enqueueLinks }) {
        log.info(`Crawling ${request.url}`);

        // Handle different page types
        if (request.label === 'CATEGORY') {
            await enqueueLinks({
                selector: 'a.product-card',
                label: 'PRODUCT',
            });

            await enqueueLinks({
                selector: 'a.pagination-next',
                label: 'CATEGORY',
            });
        }

        if (request.label === 'PRODUCT') {
            // Wait for product details to load
            await page.waitForSelector('.product-details', { timeout: 10000 });

            const product = await page.evaluate(() => ({
                name: document.querySelector('h1.product-name')?.textContent?.trim(),
                price: document.querySelector('.price')?.textContent?.trim(),
                description: document.querySelector('.description')?.textContent?.trim(),
                images: Array.from(document.querySelectorAll('.product-image img'))
                    .map(img => img.src),
                inStock: !!document.querySelector('.in-stock'),
            }));

            await Dataset.pushData({
                url: request.url,
                ...product,
                scrapedAt: new Date().toISOString(),
            });
        }
    },

    failedRequestHandler({ request, log }) {
        log.error(`Request failed: ${request.url}`);
    },
});

// Start crawling
await crawler.run([
    { url: 'https://shop.example.com/categories', label: 'CATEGORY' }
]);

// Export results
const dataset = await Dataset.open();
await dataset.exportToJSON('products.json');

Integration with Browser Automation

Crawlee integrates seamlessly with browser automation tools. For handling complex interactions like navigating to different pages or managing browser sessions, you can leverage Crawlee's built-in Puppeteer and Playwright support while benefiting from its queue management and retry logic.

Best Practices

Start Simple: Begin with CheerioCrawler for static sites, upgrade to PlaywrightCrawler only when needed
Respect Robots.txt: Use robotsTxtParser to check allowed paths
Use Rate Limiting: Configure maxRequestsPerMinute to avoid overwhelming servers
Handle Errors Gracefully: Implement proper error handling and retry logic
Monitor Performance: Use Crawlee's built-in logging and statistics
Use Proxies: Rotate proxies to avoid IP bans
Implement Depth Limits: Prevent infinite crawling with depth checks
Clean Up Resources: Properly close browser instances and clean temporary files

Conclusion

Setting up a Node.js web crawler with Crawlee provides a robust foundation for web scraping projects. The library handles complex tasks like request queuing, retries, and proxy rotation automatically, allowing you to focus on extracting the data you need. Whether you're building a simple HTML scraper or a sophisticated browser-based crawler, Crawlee offers the tools and flexibility to handle various scraping scenarios efficiently.

For dynamic websites requiring JavaScript execution, consider using handling AJAX requests techniques in combination with Crawlee's browser automation capabilities to ensure all content is properly loaded before extraction.

Table of contents

How do I set up a Node.js web crawler using Crawlee?

Prerequisites

Installing Crawlee

Basic Crawlee Setup with CheerioCrawler

Setting Up Crawlee with Playwright

Advanced Configuration

Request Queue and Storage

Proxy Configuration

Session Management and Cookies

Error Handling and Retries

Working with TypeScript

Exporting Data

Complete Example: E-commerce Crawler

Integration with Browser Automation

Best Practices

Conclusion

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Related Questions

Does Crawlee work with modern JavaScript frameworks?

How do I use Crawlee with Python for web scraping?

Is Crawlee Python as feature-complete as the JavaScript version?

Get Started Now

Support