How do I Control Concurrency and Parallelism in Crawlee?
Controlling concurrency and parallelism in Crawlee is essential for optimizing web scraping performance, managing server load, and preventing rate limiting or IP bans. Crawlee provides powerful built-in mechanisms to control how many requests are processed simultaneously, allowing you to balance speed and resource consumption effectively.
Understanding Concurrency in Crawlee
Concurrency in Crawlee refers to the number of requests or pages being processed at the same time. Higher concurrency means faster scraping but also increases resource usage (CPU, memory, network bandwidth) and the risk of being detected or blocked by target websites.
Crawlee uses an AutoscaledPool
under the hood that automatically manages concurrency based on system resources, but you can also configure manual limits to control the behavior precisely.
Basic Concurrency Configuration
Setting Maximum Concurrency
The most straightforward way to control concurrency is by setting the maxConcurrency
option when creating a crawler:
JavaScript/TypeScript:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
maxConcurrency: 10, // Process up to 10 pages simultaneously
requestHandler: async ({ page, request }) => {
console.log(`Processing: ${request.url}`);
// Your scraping logic here
},
});
await crawler.run(['https://example.com']);
Python:
from crawlee.playwright_crawler import PlaywrightCrawler
async def main():
crawler = PlaywrightCrawler(
max_concurrency=10, # Process up to 10 pages simultaneously
)
@crawler.router.default_handler
async def request_handler(context):
print(f'Processing: {context.request.url}')
# Your scraping logic here
await crawler.run(['https://example.com'])
Setting Minimum Concurrency
You can also set a minimum concurrency level to ensure a baseline level of parallelism:
JavaScript/TypeScript:
const crawler = new PlaywrightCrawler({
minConcurrency: 2, // Always maintain at least 2 concurrent requests
maxConcurrency: 20, // But never exceed 20
requestHandler: async ({ page, request }) => {
// Your scraping logic
},
});
Python:
crawler = PlaywrightCrawler(
min_concurrency=2, # Always maintain at least 2 concurrent requests
max_concurrency=20, # But never exceed 20
)
AutoscaledPool Configuration
Crawlee's AutoscaledPool
automatically adjusts concurrency based on system resources. You can fine-tune its behavior with additional options:
JavaScript/TypeScript:
import { CheerioCrawler } from 'crawlee';
const crawler = new CheerioCrawler({
maxConcurrency: 50,
minConcurrency: 5,
// AutoscaledPool options
autoscaledPoolOptions: {
// How often to check system resources (in milliseconds)
snapshotterOptions: {
eventLoopSnapshotIntervalSecs: 0.5,
clientSnapshotIntervalSecs: 1,
},
// System load thresholds
systemStatusOptions: {
maxUsedCpuRatio: 0.9, // Max 90% CPU usage
maxUsedMemoryRatio: 0.7, // Max 70% memory usage
maxClientErrors: 5, // Max failed requests before scaling down
},
// How aggressively to scale up/down
maxConcurrency: 50,
desiredConcurrency: 10,
minConcurrency: 5,
},
requestHandler: async ({ $, request }) => {
// Your scraping logic
},
});
Python:
from crawlee.cheerio_crawler import CheerioCrawler
from crawlee.autoscaling import AutoscaledPoolOptions, SystemStatusOptions
crawler = CheerioCrawler(
max_concurrency=50,
min_concurrency=5,
autoscaled_pool_options=AutoscaledPoolOptions(
system_status_options=SystemStatusOptions(
max_used_cpu_ratio=0.9, # Max 90% CPU usage
max_used_memory_ratio=0.7, # Max 70% memory usage
max_client_errors=5, # Max failed requests before scaling down
),
desired_concurrency=10,
),
)
Request Rate Limiting
To prevent overwhelming target servers or triggering rate limits, you can add delays between requests:
JavaScript/TypeScript:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
maxConcurrency: 5,
maxRequestsPerMinute: 60, // Limit to 60 requests per minute
requestHandler: async ({ page, request }) => {
// Your scraping logic
},
});
You can also add custom delays:
const crawler = new PlaywrightCrawler({
maxConcurrency: 5,
requestHandler: async ({ page, request }) => {
// Your scraping logic
// Add a random delay between 1-3 seconds
await page.waitForTimeout(1000 + Math.random() * 2000);
},
});
Python:
import asyncio
import random
crawler = PlaywrightCrawler(
max_concurrency=5,
max_requests_per_minute=60, # Limit to 60 requests per minute
)
@crawler.router.default_handler
async def request_handler(context):
# Your scraping logic
# Add a random delay between 1-3 seconds
await asyncio.sleep(1 + random.random() * 2)
Per-Domain Concurrency Limits
When scraping multiple domains, you might want to limit concurrency per domain to avoid overwhelming individual servers. While Crawlee doesn't have built-in per-domain limits, you can implement this using session pools:
JavaScript/TypeScript:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
maxConcurrency: 20, // Total concurrency across all domains
useSessionPool: true,
sessionPoolOptions: {
maxPoolSize: 20,
sessionOptions: {
maxUsageCount: 50, // Rotate sessions after 50 requests
},
},
requestHandler: async ({ page, request, session }) => {
// Tag sessions by domain for better tracking
const domain = new URL(request.url).hostname;
session.userData.domain = domain;
// Your scraping logic
},
});
Monitoring and Adjusting Concurrency
You can monitor crawler performance and adjust concurrency dynamically:
JavaScript/TypeScript:
import { PlaywrightCrawler, log } from 'crawlee';
const crawler = new PlaywrightCrawler({
maxConcurrency: 20,
requestHandler: async ({ page, request, crawler }) => {
const stats = await crawler.getStats();
log.info('Crawler stats', {
requestsFinished: stats.requestsFinished,
requestsFailed: stats.requestsFailed,
retryHistogram: stats.retryHistogram,
requestAvgFinishedDurationMillis: stats.requestAvgFinishedDurationMillis,
});
// Your scraping logic
},
});
Python:
from crawlee.playwright_crawler import PlaywrightCrawler
crawler = PlaywrightCrawler(max_concurrency=20)
@crawler.router.default_handler
async def request_handler(context):
stats = await context.crawler.get_stats()
print(f'Requests finished: {stats.requests_finished}')
print(f'Requests failed: {stats.requests_failed}')
print(f'Average duration: {stats.request_avg_finished_duration_millis}ms')
# Your scraping logic
Practical Concurrency Strategies
Strategy 1: Conservative Approach
For websites that are sensitive to scraping or have strict rate limits:
const crawler = new PlaywrightCrawler({
maxConcurrency: 2,
maxRequestsPerMinute: 30,
requestHandler: async ({ page, request }) => {
await page.waitForTimeout(2000); // 2-second delay per request
// Scraping logic
},
});
Strategy 2: Balanced Approach
For most general-purpose scraping tasks:
const crawler = new CheerioCrawler({
maxConcurrency: 10,
minConcurrency: 3,
maxRequestsPerMinute: 120,
requestHandler: async ({ $, request }) => {
// Scraping logic
},
});
Strategy 3: Aggressive Approach
For high-performance scraping with robust infrastructure:
const crawler = new PlaywrightCrawler({
maxConcurrency: 50,
minConcurrency: 10,
autoscaledPoolOptions: {
systemStatusOptions: {
maxUsedCpuRatio: 0.95,
maxUsedMemoryRatio: 0.85,
},
},
requestHandler: async ({ page, request }) => {
// Scraping logic
},
});
Combining with Proxy Rotation
When handling browser sessions in Puppeteer or using proxies, higher concurrency becomes more viable since requests are distributed across multiple IP addresses:
JavaScript/TypeScript:
const crawler = new PlaywrightCrawler({
maxConcurrency: 30, // Higher concurrency with proxies
proxyConfiguration: await ProxyConfiguration.fromOptions({
proxyUrls: [
'http://proxy1.example.com:8000',
'http://proxy2.example.com:8000',
'http://proxy3.example.com:8000',
],
}),
requestHandler: async ({ page, request }) => {
// Scraping logic
},
});
Performance Optimization Tips
Match Crawler Type to Content: Use
CheerioCrawler
for static HTML (highest concurrency), andPlaywrightCrawler
for JavaScript-rendered content (lower concurrency due to resource usage).Monitor Resource Usage: Keep an eye on CPU and memory usage. If you're hitting system limits, reduce
maxConcurrency
.Respect Target Servers: Start with conservative settings and gradually increase concurrency while monitoring for errors or blocks.
Use Request Queues Efficiently: Crawlee's request queue management works best when you let the autoscaler handle concurrency automatically.
Test Different Settings: Run benchmarks with different concurrency settings to find the optimal balance for your specific use case.
Common Pitfalls to Avoid
- Too High Concurrency: Can lead to memory exhaustion, especially with browser-based crawlers
- Too Low Concurrency: Wastes resources and makes scraping unnecessarily slow
- Ignoring Rate Limits: Can result in IP bans or temporary blocks
- Not Using Sessions: Makes it harder to manage cookies and handle authentication across concurrent requests
Conclusion
Controlling concurrency and parallelism in Crawlee is crucial for building efficient, respectful, and reliable web scrapers. Start with moderate settings, monitor your crawler's performance, and adjust based on your specific requirements and the target website's tolerance. Crawlee's autoscaling features make it easy to achieve optimal performance with minimal configuration, while still providing fine-grained control when needed.
Remember to always respect robots.txt
, implement appropriate delays, and monitor your scraping activities to ensure you're being a good web citizen while maximizing efficiency.