What is Autoscaling in Crawlee and How Does It Work?
Autoscaling in Crawlee is an intelligent system that automatically adjusts the number of concurrent requests and browser instances based on available system resources. This feature ensures optimal performance while preventing memory overflows and system crashes during web scraping operations.
Understanding Crawlee's Autoscaling Mechanism
Autoscaling dynamically monitors your system's CPU and memory usage to determine how many concurrent tasks can run safely. Instead of manually setting a fixed concurrency limit, Crawlee's AutoscaledPool
continuously evaluates system load and adjusts the number of parallel operations in real-time.
Key Components of Autoscaling
The autoscaling system consists of three main components:
- System Status Monitor - Tracks CPU usage, memory consumption, and event loop delays
- Autoscaled Pool - Manages the pool of concurrent tasks and adjusts their number
- Snapshot Tracker - Records historical performance data to make informed scaling decisions
How Autoscaling Works Under the Hood
Crawlee's autoscaling algorithm operates on a feedback loop:
- Monitoring Phase: The system continuously checks CPU usage, available memory, and event loop health
- Analysis Phase: It compares current metrics against configurable thresholds
- Adjustment Phase: Based on the analysis, it either increases or decreases concurrency
- Cooldown Period: After each adjustment, the system waits before making another change to avoid oscillation
The autoscaler uses a conservative approach by default, prioritizing system stability over maximum speed.
Configuring Autoscaling in Crawlee
Basic Configuration with PlaywrightCrawler
Here's how to configure autoscaling when using Crawlee with Playwright for handling browser sessions:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
// Autoscaling configuration
minConcurrency: 1,
maxConcurrency: 50,
autoscaledPoolOptions: {
desiredConcurrency: 10,
minConcurrency: 1,
maxConcurrency: 50,
// System load thresholds
systemStatusOptions: {
maxUsedCpuRatio: 0.90,
maxUsedMemoryRatio: 0.85,
maxEventLoopDelayMillis: 250,
},
// Scaling behavior
scaleUpStepRatio: 0.1,
scaleDownStepRatio: 0.25,
maybeRunIntervalSecs: 0.5,
},
async requestHandler({ page, request }) {
const title = await page.title();
console.log(`Scraped: ${title}`);
},
});
await crawler.run(['https://example.com']);
Configuration with CheerioCrawler
For lightweight HTTP requests that don't require a browser:
import { CheerioCrawler } from 'crawlee';
const crawler = new CheerioCrawler({
maxConcurrency: 100, // Higher limits possible without browser overhead
autoscaledPoolOptions: {
desiredConcurrency: 20,
maxConcurrency: 100,
systemStatusOptions: {
maxUsedCpuRatio: 0.95,
maxUsedMemoryRatio: 0.90,
},
},
async requestHandler({ $, request }) {
const title = $('title').text();
console.log(`Scraped: ${title}`);
},
});
await crawler.run(['https://example.com']);
Python Implementation
Crawlee for Python also supports autoscaling with similar configuration options:
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
crawler = PlaywrightCrawler(
min_concurrency=1,
max_concurrency=50,
# Autoscaling parameters
autoscaled_pool_options={
'desired_concurrency': 10,
'system_status_options': {
'max_used_cpu_ratio': 0.90,
'max_used_memory_ratio': 0.85,
},
}
)
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
title = await context.page.title()
print(f'Scraped: {title}')
await crawler.run(['https://example.com'])
Key Autoscaling Parameters Explained
Concurrency Limits
minConcurrency
: Minimum number of concurrent tasks (default: 1)maxConcurrency
: Maximum number of concurrent tasks (default: 200)desiredConcurrency
: Target concurrency level the system tries to maintain
System Status Thresholds
maxUsedCpuRatio
: Maximum CPU usage (0-1 scale, default: 0.95)maxUsedMemoryRatio
: Maximum memory usage (0-1 scale, default: 0.7)maxEventLoopDelayMillis
: Maximum acceptable event loop delay in milliseconds
Scaling Behavior
scaleUpStepRatio
: How aggressively to increase concurrency (default: 0.05)scaleDownStepRatio
: How aggressively to decrease concurrency (default: 0.05)maybeRunIntervalSecs
: How often to check system status (default: 0.5 seconds)
Best Practices for Autoscaling
1. Adjust Based on Target Website
Different websites have different resource requirements. Sites with heavy JavaScript execution need more conservative settings:
// For JavaScript-heavy sites
const heavyJsCrawler = new PlaywrightCrawler({
maxConcurrency: 10,
autoscaledPoolOptions: {
systemStatusOptions: {
maxUsedMemoryRatio: 0.70, // More conservative
},
},
});
// For simple HTML sites
const lightCrawler = new CheerioCrawler({
maxConcurrency: 200,
autoscaledPoolOptions: {
systemStatusOptions: {
maxUsedMemoryRatio: 0.90, // More aggressive
},
},
});
2. Consider Your Infrastructure
Adjust thresholds based on available resources:
// For high-memory environments (16GB+)
const highMemCrawler = new PlaywrightCrawler({
maxConcurrency: 100,
autoscaledPoolOptions: {
systemStatusOptions: {
maxUsedMemoryRatio: 0.85,
maxUsedCpuRatio: 0.90,
},
},
});
// For limited resources (4GB or less)
const lowMemCrawler = new PlaywrightCrawler({
maxConcurrency: 5,
autoscaledPoolOptions: {
systemStatusOptions: {
maxUsedMemoryRatio: 0.60,
maxUsedCpuRatio: 0.70,
},
},
});
3. Monitor and Log Performance
Track autoscaling decisions to optimize your configuration:
import { PlaywrightCrawler, log } from 'crawlee';
log.setLevel(log.LEVELS.DEBUG);
const crawler = new PlaywrightCrawler({
autoscaledPoolOptions: {
loggingIntervalSecs: 60, // Log status every 60 seconds
},
async requestHandler({ page, request, log }) {
const memUsage = process.memoryUsage();
log.info('Memory usage', {
heapUsed: `${Math.round(memUsage.heapUsed / 1024 / 1024)}MB`,
concurrency: crawler.stats.state.requestsInProgress,
});
await page.goto(request.url);
},
});
Advanced Autoscaling Scenarios
Dynamic Scaling Based on Response Times
You can implement custom scaling logic that considers response times:
import { PlaywrightCrawler } from 'crawlee';
let avgResponseTime = 0;
let requestCount = 0;
const crawler = new PlaywrightCrawler({
async requestHandler({ page, request }) {
const startTime = Date.now();
await page.goto(request.url);
const responseTime = Date.now() - startTime;
avgResponseTime = (avgResponseTime * requestCount + responseTime) / (requestCount + 1);
requestCount++;
// Adjust concurrency based on response time
if (avgResponseTime > 5000 && crawler.autoscaledPool) {
crawler.autoscaledPool.desiredConcurrency = Math.max(1,
crawler.autoscaledPool.desiredConcurrency - 1);
}
},
});
Scaling with Proxies
When monitoring network requests through proxies, you may need different scaling strategies:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
proxyConfiguration: await ProxyConfiguration.create({
proxyUrls: ['http://proxy1.com', 'http://proxy2.com'],
}),
// More conservative with proxies
maxConcurrency: 30,
autoscaledPoolOptions: {
desiredConcurrency: 5,
systemStatusOptions: {
maxUsedCpuRatio: 0.80,
},
},
});
Troubleshooting Autoscaling Issues
Issue: Memory Leaks
If memory usage keeps growing:
const crawler = new PlaywrightCrawler({
maxConcurrency: 10,
autoscaledPoolOptions: {
systemStatusOptions: {
maxUsedMemoryRatio: 0.60, // Lower threshold
},
},
// Close browser contexts regularly
async requestHandler({ page, request }) {
try {
await page.goto(request.url);
} finally {
await page.context().close();
}
},
});
Issue: Too Conservative Scaling
If your crawler isn't utilizing available resources:
const crawler = new PlaywrightCrawler({
autoscaledPoolOptions: {
scaleUpStepRatio: 0.2, // Scale up faster
scaleDownStepRatio: 0.05, // Scale down slower
systemStatusOptions: {
maxUsedMemoryRatio: 0.90,
maxUsedCpuRatio: 0.95,
},
},
});
Performance Monitoring
Track autoscaling performance with Crawlee's statistics:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
async requestHandler({ page, request }) {
await page.goto(request.url);
},
});
setInterval(() => {
const stats = crawler.stats.state;
console.log('Crawler Stats:', {
requestsInProgress: stats.requestsInProgress,
requestsFinished: stats.requestsFinished,
requestsFailed: stats.requestsFailed,
avgRequestDuration: stats.requestAvgFinishedDurationMillis,
});
}, 10000);
await crawler.run(['https://example.com']);
Conclusion
Autoscaling in Crawlee is a powerful feature that optimizes web scraping performance while maintaining system stability. By understanding and properly configuring autoscaling parameters, you can achieve the best balance between speed and resource utilization. Start with conservative settings and gradually adjust based on your specific use case, infrastructure, and target websites.
Remember that autoscaling works best when combined with other Crawlee features like request queues, retry logic, and proper error handling. The key is to monitor your crawler's performance and fine-tune the autoscaling parameters to match your operational requirements.