Table of contents

How Much Memory Does Crawlee Use for Large-Scale Scraping?

Memory consumption is a critical factor when running large-scale web scraping operations with Crawlee. Understanding how Crawlee manages memory and implementing proper optimization strategies can mean the difference between a successful scraping project and one that crashes due to resource exhaustion.

Crawlee Memory Usage Overview

Crawlee's memory usage varies significantly depending on the crawler type, concurrency settings, and the complexity of the pages being scraped. Here's a breakdown of typical memory consumption:

Base Memory Requirements

  • CheerioCrawler: 50-200 MB for basic operations, with minimal overhead per request
  • PuppeteerCrawler: 200-500 MB base + 50-150 MB per concurrent browser instance
  • PlaywrightCrawler: 250-600 MB base + 60-180 MB per concurrent browser instance

The actual memory footprint depends heavily on: - Number of concurrent requests - Page complexity and size - Request queue size - Dataset storage method - Custom middleware and plugins

Memory Consumption by Crawler Type

CheerioCrawler Memory Profile

CheerioCrawler is the most memory-efficient option since it doesn't require a full browser:

import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
    maxConcurrency: 50, // Memory-efficient: ~100-300 MB total
    requestHandler: async ({ $, request, enqueueLinks }) => {
        const title = $('title').text();
        await enqueueLinks({
            globs: ['https://example.com/**'],
        });
    },
});

await crawler.run(['https://example.com']);

Expected memory usage: 100-300 MB for 50 concurrent requests on typical web pages.

PuppeteerCrawler Memory Profile

PuppeteerCrawler requires significantly more memory due to Chromium browser instances:

import { PuppeteerCrawler } from 'crawlee';

const crawler = new PuppeteerCrawler({
    maxConcurrency: 10, // Each browser instance: ~50-150 MB
    requestHandler: async ({ page, request, enqueueLinks }) => {
        const title = await page.title();
        await enqueueLinks({
            globs: ['https://example.com/**'],
        });
    },
    launchContext: {
        launchOptions: {
            headless: true,
            args: [
                '--no-sandbox',
                '--disable-setuid-sandbox',
                '--disable-dev-shm-usage', // Reduces memory usage
                '--disable-gpu',
            ],
        },
    },
});

await crawler.run(['https://example.com']);

Expected memory usage: 700-2000 MB for 10 concurrent browser instances, depending on page complexity.

PlaywrightCrawler Memory Profile

PlaywrightCrawler has similar memory requirements to Puppeteer:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    maxConcurrency: 10,
    requestHandler: async ({ page, request, enqueueLinks }) => {
        const content = await page.content();
        await enqueueLinks({
            globs: ['https://example.com/**'],
        });
    },
    launchContext: {
        launchOptions: {
            headless: true,
            args: ['--disable-dev-shm-usage'],
        },
    },
});

await crawler.run(['https://example.com']);

Expected memory usage: 800-2500 MB for 10 concurrent browser instances.

Memory Optimization Strategies

1. Optimize Concurrency Settings

The most impactful way to control memory usage is through concurrency configuration:

import { PlaywrightCrawler, AutoscaledPool } from 'crawlee';

const crawler = new PlaywrightCrawler({
    // Limit concurrent requests based on available memory
    maxConcurrency: 5,
    minConcurrency: 1,

    // Configure autoscaling for dynamic adjustment
    autoscaledPoolOptions: {
        maxConcurrency: 10,
        minConcurrency: 1,
        desiredConcurrency: 5,
        systemStatusOptions: {
            maxUsedMemoryRatio: 0.7, // Throttle at 70% memory usage
        },
    },

    requestHandler: async ({ page, request }) => {
        // Your scraping logic
    },
});

2. Request Queue Management

Crawlee's request queue stores URLs in memory before processing. For large-scale operations, limit queue size:

import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
    maxRequestsPerCrawl: 10000, // Limit total requests
    maxConcurrency: 20,

    requestHandler: async ({ $, request, enqueueLinks }) => {
        // Only enqueue specific patterns
        await enqueueLinks({
            globs: ['https://example.com/products/**'],
            limit: 100, // Limit URLs per page
        });
    },
});

3. Dataset Storage Optimization

By default, Crawlee stores scraped data in memory. For large datasets, use persistent storage:

import { PlaywrightCrawler, Dataset } from 'crawlee';

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page, request }) => {
        const data = await page.evaluate(() => ({
            title: document.title,
            url: window.location.href,
        }));

        // Push to dataset (automatically persists to disk)
        await Dataset.pushData(data);
    },
});

// Export data periodically to free memory
await crawler.run(['https://example.com']);
const dataset = await Dataset.open();
await dataset.exportToJSON('results');
await dataset.drop(); // Clear dataset to free memory

4. Browser Context Reuse

When using browser-based crawlers, reusing browser contexts reduces memory overhead:

import { PuppeteerCrawler } from 'crawlee';

const crawler = new PuppeteerCrawler({
    maxConcurrency: 5,

    // Reuse browser instances
    launchContext: {
        useChrome: false, // Use bundled Chromium (lighter)
    },

    // Close pages after each request
    requestHandler: async ({ page, request }) => {
        try {
            // Your scraping logic
            const data = await page.evaluate(() => document.title);
        } finally {
            // Ensure page is closed to free memory
            await page.close();
        }
    },
});

5. Memory Monitoring

Implement memory monitoring to track usage during scraping:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page, request, log }) => {
        const memUsage = process.memoryUsage();
        const memUsedMB = memUsage.heapUsed / 1024 / 1024;

        log.info(`Memory: ${memUsedMB.toFixed(2)} MB`, {
            url: request.url,
            rss: (memUsage.rss / 1024 / 1024).toFixed(2),
            heapTotal: (memUsage.heapTotal / 1024 / 1024).toFixed(2),
        });

        // Your scraping logic
    },
});

// Add system status monitoring
crawler.on('systemInfo', (info) => {
    console.log(`Memory: ${info.memCurrentBytes / 1024 / 1024} MB`);
});

Python Crawlee Memory Considerations

For Python developers using Crawlee, memory management follows similar patterns:

from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
import psutil
import os

async def main():
    crawler = PlaywrightCrawler(
        max_requests_per_crawl=5000,
        max_request_retries=2,
        max_session_rotations=3,
    )

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        # Monitor memory usage
        process = psutil.Process(os.getpid())
        mem_mb = process.memory_info().rss / 1024 / 1024
        context.log.info(f'Memory usage: {mem_mb:.2f} MB')

        # Your scraping logic
        data = {'title': await context.page.title()}
        await context.push_data(data)

    await crawler.run(['https://example.com'])

Large-Scale Deployment Recommendations

Docker Container Memory Limits

When deploying Crawlee in containers, set appropriate memory limits:

FROM node:18-slim

# Install dependencies for Playwright
RUN apt-get update && apt-get install -y \
    chromium \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app
COPY package*.json ./
RUN npm install

COPY . .

# Set Node.js memory limit
ENV NODE_OPTIONS="--max-old-space-size=4096"

CMD ["node", "scraper.js"]

Docker Compose configuration:

version: '3.8'
services:
  crawler:
    build: .
    mem_limit: 6g
    mem_reservation: 4g
    environment:
      - NODE_OPTIONS=--max-old-space-size=4096

Kubernetes Resource Management

For Kubernetes deployments:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: crawlee-scraper
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: scraper
        image: crawlee-scraper:latest
        resources:
          requests:
            memory: "2Gi"
            cpu: "500m"
          limits:
            memory: "6Gi"
            cpu: "2000m"
        env:
        - name: NODE_OPTIONS
          value: "--max-old-space-size=4096"

Performance Benchmarks

Based on real-world testing, here are typical memory consumption patterns:

| Crawler Type | Concurrency | Pages/Hour | Memory Usage | |--------------|-------------|------------|--------------| | CheerioCrawler | 50 | 15,000+ | 200-500 MB | | CheerioCrawler | 100 | 30,000+ | 300-800 MB | | PuppeteerCrawler | 5 | 1,000-2,000 | 1-2 GB | | PuppeteerCrawler | 10 | 2,000-4,000 | 2-4 GB | | PlaywrightCrawler | 5 | 1,000-2,000 | 1.2-2.5 GB | | PlaywrightCrawler | 10 | 2,000-4,000 | 2.5-5 GB |

Best Practices for Memory Management

  1. Choose the Right Crawler: Use CheerioCrawler for simple HTML parsing and reserve browser-based crawlers for JavaScript-heavy single page applications.

  2. Implement Autoscaling: Let Crawlee automatically adjust concurrency based on system resources to prevent memory exhaustion.

  3. Monitor System Resources: Regularly check memory usage and adjust concurrency settings accordingly.

  4. Use Persistent Storage: For large datasets, persist data to disk instead of keeping everything in memory.

  5. Set Reasonable Limits: Configure maxRequestsPerCrawl to prevent unbounded memory growth.

  6. Clean Up Resources: Ensure browser pages and contexts are properly closed after use.

  7. Handle Memory Leaks: Handle errors properly to prevent resource leaks from failed requests.

  8. Optimize Browser Args: Use memory-saving flags like --disable-dev-shm-usage for browser-based crawlers.

Troubleshooting Memory Issues

If you encounter out-of-memory errors:

  1. Reduce maxConcurrency to lower concurrent operations
  2. Enable maxUsedMemoryRatio in autoscaling options
  3. Increase Node.js heap size with --max-old-space-size
  4. Switch to a more memory-efficient crawler type
  5. Implement data export and cleanup between batches
  6. Use Docker containers with explicit memory limits

Conclusion

Crawlee's memory usage for large-scale scraping ranges from 200 MB for lightweight CheerioCrawler operations to several gigabytes for concurrent browser-based crawlers. By understanding these patterns and implementing proper optimization strategies, you can build scalable scraping solutions that efficiently manage system resources. Always monitor memory consumption during development and adjust concurrency settings based on your infrastructure capabilities and scraping requirements.

For production deployments, start with conservative concurrency settings and gradually increase them while monitoring system performance. This approach ensures stable, long-running scraping operations that can handle large-scale data extraction tasks without memory-related failures.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon