Table of contents

What is Autoscaling in Crawlee and How Does It Work?

Autoscaling in Crawlee is an intelligent system that automatically adjusts the number of concurrent requests and browser instances based on available system resources. This feature ensures optimal performance while preventing memory overflows and system crashes during web scraping operations.

Understanding Crawlee's Autoscaling Mechanism

Autoscaling dynamically monitors your system's CPU and memory usage to determine how many concurrent tasks can run safely. Instead of manually setting a fixed concurrency limit, Crawlee's AutoscaledPool continuously evaluates system load and adjusts the number of parallel operations in real-time.

Key Components of Autoscaling

The autoscaling system consists of three main components:

  1. System Status Monitor - Tracks CPU usage, memory consumption, and event loop delays
  2. Autoscaled Pool - Manages the pool of concurrent tasks and adjusts their number
  3. Snapshot Tracker - Records historical performance data to make informed scaling decisions

How Autoscaling Works Under the Hood

Crawlee's autoscaling algorithm operates on a feedback loop:

  1. Monitoring Phase: The system continuously checks CPU usage, available memory, and event loop health
  2. Analysis Phase: It compares current metrics against configurable thresholds
  3. Adjustment Phase: Based on the analysis, it either increases or decreases concurrency
  4. Cooldown Period: After each adjustment, the system waits before making another change to avoid oscillation

The autoscaler uses a conservative approach by default, prioritizing system stability over maximum speed.

Configuring Autoscaling in Crawlee

Basic Configuration with PlaywrightCrawler

Here's how to configure autoscaling when using Crawlee with Playwright for handling browser sessions:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    // Autoscaling configuration
    minConcurrency: 1,
    maxConcurrency: 50,
    autoscaledPoolOptions: {
        desiredConcurrency: 10,
        minConcurrency: 1,
        maxConcurrency: 50,
        // System load thresholds
        systemStatusOptions: {
            maxUsedCpuRatio: 0.90,
            maxUsedMemoryRatio: 0.85,
            maxEventLoopDelayMillis: 250,
        },
        // Scaling behavior
        scaleUpStepRatio: 0.1,
        scaleDownStepRatio: 0.25,
        maybeRunIntervalSecs: 0.5,
    },

    async requestHandler({ page, request }) {
        const title = await page.title();
        console.log(`Scraped: ${title}`);
    },
});

await crawler.run(['https://example.com']);

Configuration with CheerioCrawler

For lightweight HTTP requests that don't require a browser:

import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
    maxConcurrency: 100, // Higher limits possible without browser overhead
    autoscaledPoolOptions: {
        desiredConcurrency: 20,
        maxConcurrency: 100,
        systemStatusOptions: {
            maxUsedCpuRatio: 0.95,
            maxUsedMemoryRatio: 0.90,
        },
    },

    async requestHandler({ $, request }) {
        const title = $('title').text();
        console.log(`Scraped: ${title}`);
    },
});

await crawler.run(['https://example.com']);

Python Implementation

Crawlee for Python also supports autoscaling with similar configuration options:

from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext

crawler = PlaywrightCrawler(
    min_concurrency=1,
    max_concurrency=50,
    # Autoscaling parameters
    autoscaled_pool_options={
        'desired_concurrency': 10,
        'system_status_options': {
            'max_used_cpu_ratio': 0.90,
            'max_used_memory_ratio': 0.85,
        },
    }
)

@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
    title = await context.page.title()
    print(f'Scraped: {title}')

await crawler.run(['https://example.com'])

Key Autoscaling Parameters Explained

Concurrency Limits

  • minConcurrency: Minimum number of concurrent tasks (default: 1)
  • maxConcurrency: Maximum number of concurrent tasks (default: 200)
  • desiredConcurrency: Target concurrency level the system tries to maintain

System Status Thresholds

  • maxUsedCpuRatio: Maximum CPU usage (0-1 scale, default: 0.95)
  • maxUsedMemoryRatio: Maximum memory usage (0-1 scale, default: 0.7)
  • maxEventLoopDelayMillis: Maximum acceptable event loop delay in milliseconds

Scaling Behavior

  • scaleUpStepRatio: How aggressively to increase concurrency (default: 0.05)
  • scaleDownStepRatio: How aggressively to decrease concurrency (default: 0.05)
  • maybeRunIntervalSecs: How often to check system status (default: 0.5 seconds)

Best Practices for Autoscaling

1. Adjust Based on Target Website

Different websites have different resource requirements. Sites with heavy JavaScript execution need more conservative settings:

// For JavaScript-heavy sites
const heavyJsCrawler = new PlaywrightCrawler({
    maxConcurrency: 10,
    autoscaledPoolOptions: {
        systemStatusOptions: {
            maxUsedMemoryRatio: 0.70, // More conservative
        },
    },
});

// For simple HTML sites
const lightCrawler = new CheerioCrawler({
    maxConcurrency: 200,
    autoscaledPoolOptions: {
        systemStatusOptions: {
            maxUsedMemoryRatio: 0.90, // More aggressive
        },
    },
});

2. Consider Your Infrastructure

Adjust thresholds based on available resources:

// For high-memory environments (16GB+)
const highMemCrawler = new PlaywrightCrawler({
    maxConcurrency: 100,
    autoscaledPoolOptions: {
        systemStatusOptions: {
            maxUsedMemoryRatio: 0.85,
            maxUsedCpuRatio: 0.90,
        },
    },
});

// For limited resources (4GB or less)
const lowMemCrawler = new PlaywrightCrawler({
    maxConcurrency: 5,
    autoscaledPoolOptions: {
        systemStatusOptions: {
            maxUsedMemoryRatio: 0.60,
            maxUsedCpuRatio: 0.70,
        },
    },
});

3. Monitor and Log Performance

Track autoscaling decisions to optimize your configuration:

import { PlaywrightCrawler, log } from 'crawlee';

log.setLevel(log.LEVELS.DEBUG);

const crawler = new PlaywrightCrawler({
    autoscaledPoolOptions: {
        loggingIntervalSecs: 60, // Log status every 60 seconds
    },

    async requestHandler({ page, request, log }) {
        const memUsage = process.memoryUsage();
        log.info('Memory usage', {
            heapUsed: `${Math.round(memUsage.heapUsed / 1024 / 1024)}MB`,
            concurrency: crawler.stats.state.requestsInProgress,
        });

        await page.goto(request.url);
    },
});

Advanced Autoscaling Scenarios

Dynamic Scaling Based on Response Times

You can implement custom scaling logic that considers response times:

import { PlaywrightCrawler } from 'crawlee';

let avgResponseTime = 0;
let requestCount = 0;

const crawler = new PlaywrightCrawler({
    async requestHandler({ page, request }) {
        const startTime = Date.now();

        await page.goto(request.url);

        const responseTime = Date.now() - startTime;
        avgResponseTime = (avgResponseTime * requestCount + responseTime) / (requestCount + 1);
        requestCount++;

        // Adjust concurrency based on response time
        if (avgResponseTime > 5000 && crawler.autoscaledPool) {
            crawler.autoscaledPool.desiredConcurrency = Math.max(1,
                crawler.autoscaledPool.desiredConcurrency - 1);
        }
    },
});

Scaling with Proxies

When monitoring network requests through proxies, you may need different scaling strategies:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    proxyConfiguration: await ProxyConfiguration.create({
        proxyUrls: ['http://proxy1.com', 'http://proxy2.com'],
    }),

    // More conservative with proxies
    maxConcurrency: 30,
    autoscaledPoolOptions: {
        desiredConcurrency: 5,
        systemStatusOptions: {
            maxUsedCpuRatio: 0.80,
        },
    },
});

Troubleshooting Autoscaling Issues

Issue: Memory Leaks

If memory usage keeps growing:

const crawler = new PlaywrightCrawler({
    maxConcurrency: 10,
    autoscaledPoolOptions: {
        systemStatusOptions: {
            maxUsedMemoryRatio: 0.60, // Lower threshold
        },
    },

    // Close browser contexts regularly
    async requestHandler({ page, request }) {
        try {
            await page.goto(request.url);
        } finally {
            await page.context().close();
        }
    },
});

Issue: Too Conservative Scaling

If your crawler isn't utilizing available resources:

const crawler = new PlaywrightCrawler({
    autoscaledPoolOptions: {
        scaleUpStepRatio: 0.2, // Scale up faster
        scaleDownStepRatio: 0.05, // Scale down slower
        systemStatusOptions: {
            maxUsedMemoryRatio: 0.90,
            maxUsedCpuRatio: 0.95,
        },
    },
});

Performance Monitoring

Track autoscaling performance with Crawlee's statistics:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    async requestHandler({ page, request }) {
        await page.goto(request.url);
    },
});

setInterval(() => {
    const stats = crawler.stats.state;
    console.log('Crawler Stats:', {
        requestsInProgress: stats.requestsInProgress,
        requestsFinished: stats.requestsFinished,
        requestsFailed: stats.requestsFailed,
        avgRequestDuration: stats.requestAvgFinishedDurationMillis,
    });
}, 10000);

await crawler.run(['https://example.com']);

Conclusion

Autoscaling in Crawlee is a powerful feature that optimizes web scraping performance while maintaining system stability. By understanding and properly configuring autoscaling parameters, you can achieve the best balance between speed and resource utilization. Start with conservative settings and gradually adjust based on your specific use case, infrastructure, and target websites.

Remember that autoscaling works best when combined with other Crawlee features like request queues, retry logic, and proper error handling. The key is to monitor your crawler's performance and fine-tune the autoscaling parameters to match your operational requirements.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon