What is Autoscaling in Crawlee and How Does It Work?

Autoscaling in Crawlee is an intelligent system that automatically adjusts the number of concurrent requests and browser instances based on available system resources. This feature ensures optimal performance while preventing memory overflows and system crashes during web scraping operations.

Understanding Crawlee's Autoscaling Mechanism

Autoscaling dynamically monitors your system's CPU and memory usage to determine how many concurrent tasks can run safely. Instead of manually setting a fixed concurrency limit, Crawlee's AutoscaledPool continuously evaluates system load and adjusts the number of parallel operations in real-time.

Key Components of Autoscaling

The autoscaling system consists of three main components:

System Status Monitor - Tracks CPU usage, memory consumption, and event loop delays
Autoscaled Pool - Manages the pool of concurrent tasks and adjusts their number
Snapshot Tracker - Records historical performance data to make informed scaling decisions

How Autoscaling Works Under the Hood

Crawlee's autoscaling algorithm operates on a feedback loop:

Monitoring Phase: The system continuously checks CPU usage, available memory, and event loop health
Analysis Phase: It compares current metrics against configurable thresholds
Adjustment Phase: Based on the analysis, it either increases or decreases concurrency
Cooldown Period: After each adjustment, the system waits before making another change to avoid oscillation

The autoscaler uses a conservative approach by default, prioritizing system stability over maximum speed.

Configuring Autoscaling in Crawlee

Basic Configuration with PlaywrightCrawler

Here's how to configure autoscaling when using Crawlee with Playwright for handling browser sessions:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    // Autoscaling configuration
    minConcurrency: 1,
    maxConcurrency: 50,
    autoscaledPoolOptions: {
        desiredConcurrency: 10,
        minConcurrency: 1,
        maxConcurrency: 50,
        // System load thresholds
        systemStatusOptions: {
            maxUsedCpuRatio: 0.90,
            maxUsedMemoryRatio: 0.85,
            maxEventLoopDelayMillis: 250,
        },
        // Scaling behavior
        scaleUpStepRatio: 0.1,
        scaleDownStepRatio: 0.25,
        maybeRunIntervalSecs: 0.5,
    },

    async requestHandler({ page, request }) {
        const title = await page.title();
        console.log(`Scraped: ${title}`);
    },
});

await crawler.run(['https://example.com']);

Configuration with CheerioCrawler

For lightweight HTTP requests that don't require a browser:

import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
    maxConcurrency: 100, // Higher limits possible without browser overhead
    autoscaledPoolOptions: {
        desiredConcurrency: 20,
        maxConcurrency: 100,
        systemStatusOptions: {
            maxUsedCpuRatio: 0.95,
            maxUsedMemoryRatio: 0.90,
        },
    },

    async requestHandler({ $, request }) {
        const title = $('title').text();
        console.log(`Scraped: ${title}`);
    },
});

await crawler.run(['https://example.com']);

Python Implementation

Crawlee for Python also supports autoscaling with similar configuration options:

from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext

crawler = PlaywrightCrawler(
    min_concurrency=1,
    max_concurrency=50,
    # Autoscaling parameters
    autoscaled_pool_options={
        'desired_concurrency': 10,
        'system_status_options': {
            'max_used_cpu_ratio': 0.90,
            'max_used_memory_ratio': 0.85,
        },
    }
)

@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
    title = await context.page.title()
    print(f'Scraped: {title}')

await crawler.run(['https://example.com'])

Key Autoscaling Parameters Explained

Concurrency Limits

minConcurrency: Minimum number of concurrent tasks (default: 1)
maxConcurrency: Maximum number of concurrent tasks (default: 200)
desiredConcurrency: Target concurrency level the system tries to maintain

System Status Thresholds

maxUsedCpuRatio: Maximum CPU usage (0-1 scale, default: 0.95)
maxUsedMemoryRatio: Maximum memory usage (0-1 scale, default: 0.7)
maxEventLoopDelayMillis: Maximum acceptable event loop delay in milliseconds

Scaling Behavior

scaleUpStepRatio: How aggressively to increase concurrency (default: 0.05)
scaleDownStepRatio: How aggressively to decrease concurrency (default: 0.05)
maybeRunIntervalSecs: How often to check system status (default: 0.5 seconds)

Best Practices for Autoscaling

1. Adjust Based on Target Website

Different websites have different resource requirements. Sites with heavy JavaScript execution need more conservative settings:

// For JavaScript-heavy sites
const heavyJsCrawler = new PlaywrightCrawler({
    maxConcurrency: 10,
    autoscaledPoolOptions: {
        systemStatusOptions: {
            maxUsedMemoryRatio: 0.70, // More conservative
        },
    },
});

// For simple HTML sites
const lightCrawler = new CheerioCrawler({
    maxConcurrency: 200,
    autoscaledPoolOptions: {
        systemStatusOptions: {
            maxUsedMemoryRatio: 0.90, // More aggressive
        },
    },
});

2. Consider Your Infrastructure

Adjust thresholds based on available resources:

// For high-memory environments (16GB+)
const highMemCrawler = new PlaywrightCrawler({
    maxConcurrency: 100,
    autoscaledPoolOptions: {
        systemStatusOptions: {
            maxUsedMemoryRatio: 0.85,
            maxUsedCpuRatio: 0.90,
        },
    },
});

// For limited resources (4GB or less)
const lowMemCrawler = new PlaywrightCrawler({
    maxConcurrency: 5,
    autoscaledPoolOptions: {
        systemStatusOptions: {
            maxUsedMemoryRatio: 0.60,
            maxUsedCpuRatio: 0.70,
        },
    },
});

3. Monitor and Log Performance

Track autoscaling decisions to optimize your configuration:

import { PlaywrightCrawler, log } from 'crawlee';

log.setLevel(log.LEVELS.DEBUG);

const crawler = new PlaywrightCrawler({
    autoscaledPoolOptions: {
        loggingIntervalSecs: 60, // Log status every 60 seconds
    },

    async requestHandler({ page, request, log }) {
        const memUsage = process.memoryUsage();
        log.info('Memory usage', {
            heapUsed: `${Math.round(memUsage.heapUsed / 1024 / 1024)}MB`,
            concurrency: crawler.stats.state.requestsInProgress,
        });

        await page.goto(request.url);
    },
});

Advanced Autoscaling Scenarios

Dynamic Scaling Based on Response Times

You can implement custom scaling logic that considers response times:

import { PlaywrightCrawler } from 'crawlee';

let avgResponseTime = 0;
let requestCount = 0;

const crawler = new PlaywrightCrawler({
    async requestHandler({ page, request }) {
        const startTime = Date.now();

        await page.goto(request.url);

        const responseTime = Date.now() - startTime;
        avgResponseTime = (avgResponseTime * requestCount + responseTime) / (requestCount + 1);
        requestCount++;

        // Adjust concurrency based on response time
        if (avgResponseTime > 5000 && crawler.autoscaledPool) {
            crawler.autoscaledPool.desiredConcurrency = Math.max(1,
                crawler.autoscaledPool.desiredConcurrency - 1);
        }
    },
});

Scaling with Proxies

When monitoring network requests through proxies, you may need different scaling strategies:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    proxyConfiguration: await ProxyConfiguration.create({
        proxyUrls: ['http://proxy1.com', 'http://proxy2.com'],
    }),

    // More conservative with proxies
    maxConcurrency: 30,
    autoscaledPoolOptions: {
        desiredConcurrency: 5,
        systemStatusOptions: {
            maxUsedCpuRatio: 0.80,
        },
    },
});

Troubleshooting Autoscaling Issues

Issue: Memory Leaks

If memory usage keeps growing:

const crawler = new PlaywrightCrawler({
    maxConcurrency: 10,
    autoscaledPoolOptions: {
        systemStatusOptions: {
            maxUsedMemoryRatio: 0.60, // Lower threshold
        },
    },

    // Close browser contexts regularly
    async requestHandler({ page, request }) {
        try {
            await page.goto(request.url);
        } finally {
            await page.context().close();
        }
    },
});

Issue: Too Conservative Scaling

If your crawler isn't utilizing available resources:

const crawler = new PlaywrightCrawler({
    autoscaledPoolOptions: {
        scaleUpStepRatio: 0.2, // Scale up faster
        scaleDownStepRatio: 0.05, // Scale down slower
        systemStatusOptions: {
            maxUsedMemoryRatio: 0.90,
            maxUsedCpuRatio: 0.95,
        },
    },
});

Performance Monitoring

Track autoscaling performance with Crawlee's statistics:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    async requestHandler({ page, request }) {
        await page.goto(request.url);
    },
});

setInterval(() => {
    const stats = crawler.stats.state;
    console.log('Crawler Stats:', {
        requestsInProgress: stats.requestsInProgress,
        requestsFinished: stats.requestsFinished,
        requestsFailed: stats.requestsFailed,
        avgRequestDuration: stats.requestAvgFinishedDurationMillis,
    });
}, 10000);

await crawler.run(['https://example.com']);

Conclusion

Autoscaling in Crawlee is a powerful feature that optimizes web scraping performance while maintaining system stability. By understanding and properly configuring autoscaling parameters, you can achieve the best balance between speed and resource utilization. Start with conservative settings and gradually adjust based on your specific use case, infrastructure, and target websites.

Remember that autoscaling works best when combined with other Crawlee features like request queues, retry logic, and proper error handling. The key is to monitor your crawler's performance and fine-tune the autoscaling parameters to match your operational requirements.

Table of contents