Table of contents

What Are the Best Practices for Web Scraping with Crawlee?

Crawlee is a powerful web scraping and browser automation library for Node.js and Python that provides built-in features for robust, scalable scraping. Following best practices ensures your scrapers are efficient, maintainable, and respectful of target websites. This guide covers essential practices for building production-ready Crawlee scrapers.

1. Choose the Right Crawler Type

Crawlee offers multiple crawler classes, each optimized for different scenarios:

CheerioCrawler for Static Content

Use CheerioCrawler for fast scraping of static HTML pages. It's the most efficient option when JavaScript execution isn't required.

import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
    requestHandler: async ({ request, $, enqueueLinks }) => {
        const title = $('h1').text();
        const description = $('.description').text();

        await enqueueLinks({
            globs: ['https://example.com/products/*'],
        });

        return { url: request.url, title, description };
    },
});

await crawler.run(['https://example.com']);

PuppeteerCrawler or PlaywrightCrawler for Dynamic Content

When dealing with JavaScript-rendered content, single-page applications, or pages requiring browser automation, use PuppeteerCrawler or PlaywrightCrawler.

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page, request, enqueueLinks }) => {
        // Wait for dynamic content to load
        await page.waitForSelector('.product-list');

        const products = await page.$$eval('.product', items =>
            items.map(item => ({
                name: item.querySelector('.name')?.textContent,
                price: item.querySelector('.price')?.textContent,
            }))
        );

        await enqueueLinks({
            selector: '.pagination a',
        });

        return { url: request.url, products };
    },
    headless: true,
    maxRequestRetries: 3,
});

await crawler.run(['https://example.com/products']);

2. Implement Proper Error Handling

Robust error handling prevents scraper crashes and ensures data consistency.

Use Request Error Handlers

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page, request, log }) => {
        try {
            await page.goto(request.url, { waitUntil: 'networkidle' });
            // Scraping logic here
        } catch (error) {
            log.error(`Failed to process ${request.url}`, { error });
            throw error; // Let Crawlee handle retries
        }
    },

    failedRequestHandler: async ({ request, log }) => {
        log.error(`Request ${request.url} failed after ${request.retryCount} retries`);
        // Store failed URLs for later review
    },

    maxRequestRetries: 3,
    maxRequestsPerMinute: 60,
});

Handle Timeouts Gracefully

Similar to handling timeouts in Puppeteer, configure appropriate timeout values:

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page, request }) => {
        await page.goto(request.url, {
            timeout: 30000, // 30 seconds
            waitUntil: 'domcontentloaded'
        });

        await page.waitForSelector('.content', {
            timeout: 10000
        });
    },

    requestHandlerTimeoutSecs: 60,
});

3. Optimize Performance and Concurrency

Configure Concurrency Settings

const crawler = new CheerioCrawler({
    maxConcurrency: 10, // Run up to 10 requests simultaneously
    minConcurrency: 2,  // Keep at least 2 workers active
    maxRequestsPerMinute: 120, // Respect rate limits

    requestHandler: async ({ request, $ }) => {
        // Scraping logic
    },
});

Use AutoscaledPool for Dynamic Scaling

Crawlee's autoscaling automatically adjusts concurrency based on system resources:

const crawler = new PlaywrightCrawler({
    autoscaledPoolOptions: {
        minConcurrency: 1,
        maxConcurrency: 50,
        desiredConcurrency: 10,
        systemStatusOptions: {
            maxUsedCpuRatio: 0.8,  // Pause if CPU usage exceeds 80%
            maxUsedMemoryRatio: 0.7, // Pause if memory usage exceeds 70%
        },
    },
});

4. Manage Request Queues Effectively

Use Request Queue for Persistent Storage

import { PlaywrightCrawler, RequestQueue } from 'crawlee';

const requestQueue = await RequestQueue.open('my-queue');

// Add initial URLs
await requestQueue.addRequest({ url: 'https://example.com' });

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ request, enqueueLinks }) => {
        await enqueueLinks({
            globs: ['https://example.com/category/*'],
            requestQueue, // Use the same queue
        });
    },
});

await crawler.run();

Filter and Prioritize Requests

const crawler = new CheerioCrawler({
    requestHandler: async ({ request, enqueueLinks }) => {
        await enqueueLinks({
            globs: ['https://example.com/**'],

            // Filter out unwanted URLs
            transformRequestFunction: (req) => {
                if (req.url.includes('login') || req.url.includes('signup')) {
                    return false; // Skip these URLs
                }

                // Prioritize product pages
                if (req.url.includes('/product/')) {
                    req.userData.priority = 10;
                }

                return req;
            },
        });
    },
});

5. Implement Session and Proxy Management

Use Session Pools for Authentication

import { PlaywrightCrawler, SessionPool } from 'crawlee';

const sessionPool = await SessionPool.open({
    maxPoolSize: 10,
    sessionOptions: {
        maxUsageCount: 50, // Retire session after 50 uses
        maxErrorScore: 3,   // Retire session after 3 errors
    },
});

const crawler = new PlaywrightCrawler({
    useSessionPool: true,
    sessionPoolOptions: sessionPool,

    requestHandler: async ({ page, session }) => {
        // Session cookies are automatically managed
        const isLoggedIn = await page.$('.user-menu');

        if (!isLoggedIn) {
            session.retire(); // Mark session as invalid
        }
    },
});

Configure Proxy Rotation

import { PlaywrightCrawler, ProxyConfiguration } from 'crawlee';

const proxyConfiguration = new ProxyConfiguration({
    proxyUrls: [
        'http://proxy1.example.com:8000',
        'http://proxy2.example.com:8000',
    ],
});

const crawler = new PlaywrightCrawler({
    proxyConfiguration,

    requestHandler: async ({ page, proxyInfo }) => {
        console.log(`Using proxy: ${proxyInfo.url}`);
        // Scraping logic
    },
});

6. Store Data Efficiently

Use Datasets for Structured Data

import { PlaywrightCrawler, Dataset } from 'crawlee';

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page, request }) => {
        const data = await page.evaluate(() => ({
            title: document.querySelector('h1')?.textContent,
            price: document.querySelector('.price')?.textContent,
            description: document.querySelector('.description')?.textContent,
        }));

        // Push data to default dataset
        await Dataset.pushData({
            url: request.url,
            scrapedAt: new Date().toISOString(),
            ...data,
        });
    },
});

await crawler.run(['https://example.com/product']);

// Export data after scraping
const dataset = await Dataset.open();
await dataset.exportToJSON('results');
await dataset.exportToCSV('results');

Use Key-Value Stores for State Management

import { KeyValueStore } from 'crawlee';

const store = await KeyValueStore.open('my-store');

// Save scraping state
await store.setValue('checkpoint', {
    lastProcessedUrl: 'https://example.com/page/100',
    processedCount: 1000,
    timestamp: Date.now(),
});

// Resume from checkpoint
const checkpoint = await store.getValue('checkpoint');
if (checkpoint) {
    console.log(`Resuming from ${checkpoint.lastProcessedUrl}`);
}

7. Follow Ethical Scraping Practices

Respect robots.txt

While Crawlee doesn't enforce robots.txt by default, you should respect it:

import { CheerioCrawler } from 'crawlee';
import robotsParser from 'robots-parser';

const crawler = new CheerioCrawler({
    requestHandler: async ({ request, crawler }) => {
        // Check robots.txt before scraping
        const response = await fetch(`${new URL(request.url).origin}/robots.txt`);
        const robotsTxt = await response.text();
        const robots = robotsParser(request.url, robotsTxt);

        if (!robots.isAllowed(request.url, 'Crawlee')) {
            throw new Error('URL disallowed by robots.txt');
        }

        // Continue with scraping
    },
});

Implement Rate Limiting

const crawler = new CheerioCrawler({
    maxRequestsPerMinute: 60, // 60 requests per minute
    maxRequestsPerCrawl: 1000, // Limit total requests

    requestHandler: async ({ request, log }) => {
        log.info(`Processing: ${request.url}`);
        // Add delays if needed
        await new Promise(resolve => setTimeout(resolve, 1000));
    },
});

Set Proper User-Agent

const crawler = new PlaywrightCrawler({
    launchContext: {
        launchOptions: {
            userAgent: 'MyBot/1.0 (https://mywebsite.com/bot-info; contact@mywebsite.com)',
        },
    },
});

8. Monitor and Log Effectively

Use Built-in Logging

import { PlaywrightCrawler, log } from 'crawlee';

log.setLevel(log.LEVELS.DEBUG); // Set global log level

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ request, log }) => {
        log.info(`Processing ${request.url}`);
        log.debug('Detailed debug information');

        try {
            // Scraping logic
        } catch (error) {
            log.error('Scraping failed', { error, url: request.url });
        }
    },
});

Track Statistics

const crawler = new CheerioCrawler({
    requestHandler: async ({ request }) => {
        // Scraping logic
    },
});

await crawler.run(['https://example.com']);

// Get statistics after run
const stats = await crawler.stats;
console.log(`Requests processed: ${stats.requestsFinished}`);
console.log(`Requests failed: ${stats.requestsFailed}`);
console.log(`Average processing time: ${stats.requestAvgFinishedDurationMillis}ms`);

9. Python-Specific Best Practices

For Python users, Crawlee for Python offers similar functionality:

from crawlee.playwright_crawler import PlaywrightCrawler
from crawlee import Request

crawler = PlaywrightCrawler(
    max_requests_per_crawl=100,
    max_request_retries=3,
    max_requests_per_minute=60,
)

@crawler.router.default_handler
async def request_handler(context):
    # Type-safe context with IDE autocomplete
    url = context.request.url
    page = context.page

    await page.wait_for_selector('.content')

    title = await page.query_selector('.title')
    title_text = await title.inner_text() if title else None

    await context.push_data({
        'url': url,
        'title': title_text,
    })

    # Enqueue links
    await context.enqueue_links(
        selector='a.product-link',
        globs=['https://example.com/products/*'],
    )

await crawler.run(['https://example.com'])

10. Testing and Debugging

Test with Small Datasets

const crawler = new CheerioCrawler({
    maxRequestsPerCrawl: 10, // Limit for testing

    requestHandler: async ({ request, log }) => {
        log.debug(`Testing with ${request.url}`);
        // Your scraping logic
    },
});

Use Headless: false for Debugging

When debugging browser-based crawlers, you can follow similar approaches to handling errors in Puppeteer:

const crawler = new PlaywrightCrawler({
    headless: false, // Show browser window
    launchContext: {
        launchOptions: {
            slowMo: 100, // Slow down operations by 100ms
            devtools: true, // Open DevTools
        },
    },
});

Conclusion

Following these best practices will help you build robust, efficient, and maintainable web scrapers with Crawlee. Key takeaways include:

  • Choose the appropriate crawler type for your use case
  • Implement comprehensive error handling and retry logic
  • Optimize performance with proper concurrency settings
  • Manage requests, sessions, and proxies effectively
  • Store data efficiently using Datasets and Key-Value Stores
  • Follow ethical scraping practices and respect website policies
  • Monitor and log scraper activity for debugging and optimization

By adhering to these guidelines, you'll create scrapers that are not only effective but also respectful of target websites and maintainable in the long term. Whether you're scraping static HTML or complex JavaScript applications, Crawlee provides the tools you need to implement these best practices successfully.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon