How do I Control Concurrency and Parallelism in Crawlee?

Controlling concurrency and parallelism in Crawlee is essential for optimizing web scraping performance, managing server load, and preventing rate limiting or IP bans. Crawlee provides powerful built-in mechanisms to control how many requests are processed simultaneously, allowing you to balance speed and resource consumption effectively.

Understanding Concurrency in Crawlee

Concurrency in Crawlee refers to the number of requests or pages being processed at the same time. Higher concurrency means faster scraping but also increases resource usage (CPU, memory, network bandwidth) and the risk of being detected or blocked by target websites.

Crawlee uses an AutoscaledPool under the hood that automatically manages concurrency based on system resources, but you can also configure manual limits to control the behavior precisely.

Basic Concurrency Configuration

Setting Maximum Concurrency

The most straightforward way to control concurrency is by setting the maxConcurrency option when creating a crawler:

JavaScript/TypeScript:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    maxConcurrency: 10, // Process up to 10 pages simultaneously
    requestHandler: async ({ page, request }) => {
        console.log(`Processing: ${request.url}`);
        // Your scraping logic here
    },
});

await crawler.run(['https://example.com']);

Python:

from crawlee.playwright_crawler import PlaywrightCrawler

async def main():
    crawler = PlaywrightCrawler(
        max_concurrency=10,  # Process up to 10 pages simultaneously
    )

    @crawler.router.default_handler
    async def request_handler(context):
        print(f'Processing: {context.request.url}')
        # Your scraping logic here

    await crawler.run(['https://example.com'])

Setting Minimum Concurrency

You can also set a minimum concurrency level to ensure a baseline level of parallelism:

JavaScript/TypeScript:

const crawler = new PlaywrightCrawler({
    minConcurrency: 2,  // Always maintain at least 2 concurrent requests
    maxConcurrency: 20, // But never exceed 20
    requestHandler: async ({ page, request }) => {
        // Your scraping logic
    },
});

Python:

crawler = PlaywrightCrawler(
    min_concurrency=2,   # Always maintain at least 2 concurrent requests
    max_concurrency=20,  # But never exceed 20
)

AutoscaledPool Configuration

Crawlee's AutoscaledPool automatically adjusts concurrency based on system resources. You can fine-tune its behavior with additional options:

JavaScript/TypeScript:

import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
    maxConcurrency: 50,
    minConcurrency: 5,

    // AutoscaledPool options
    autoscaledPoolOptions: {
        // How often to check system resources (in milliseconds)
        snapshotterOptions: {
            eventLoopSnapshotIntervalSecs: 0.5,
            clientSnapshotIntervalSecs: 1,
        },

        // System load thresholds
        systemStatusOptions: {
            maxUsedCpuRatio: 0.9,      // Max 90% CPU usage
            maxUsedMemoryRatio: 0.7,    // Max 70% memory usage
            maxClientErrors: 5,          // Max failed requests before scaling down
        },

        // How aggressively to scale up/down
        maxConcurrency: 50,
        desiredConcurrency: 10,
        minConcurrency: 5,
    },

    requestHandler: async ({ $, request }) => {
        // Your scraping logic
    },
});

Python:

from crawlee.cheerio_crawler import CheerioCrawler
from crawlee.autoscaling import AutoscaledPoolOptions, SystemStatusOptions

crawler = CheerioCrawler(
    max_concurrency=50,
    min_concurrency=5,
    autoscaled_pool_options=AutoscaledPoolOptions(
        system_status_options=SystemStatusOptions(
            max_used_cpu_ratio=0.9,     # Max 90% CPU usage
            max_used_memory_ratio=0.7,  # Max 70% memory usage
            max_client_errors=5,         # Max failed requests before scaling down
        ),
        desired_concurrency=10,
    ),
)

Request Rate Limiting

To prevent overwhelming target servers or triggering rate limits, you can add delays between requests:

JavaScript/TypeScript:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    maxConcurrency: 5,
    maxRequestsPerMinute: 60,  // Limit to 60 requests per minute

    requestHandler: async ({ page, request }) => {
        // Your scraping logic
    },
});

You can also add custom delays:

const crawler = new PlaywrightCrawler({
    maxConcurrency: 5,

    requestHandler: async ({ page, request }) => {
        // Your scraping logic

        // Add a random delay between 1-3 seconds
        await page.waitForTimeout(1000 + Math.random() * 2000);
    },
});

Python:

import asyncio
import random

crawler = PlaywrightCrawler(
    max_concurrency=5,
    max_requests_per_minute=60,  # Limit to 60 requests per minute
)

@crawler.router.default_handler
async def request_handler(context):
    # Your scraping logic

    # Add a random delay between 1-3 seconds
    await asyncio.sleep(1 + random.random() * 2)

Per-Domain Concurrency Limits

When scraping multiple domains, you might want to limit concurrency per domain to avoid overwhelming individual servers. While Crawlee doesn't have built-in per-domain limits, you can implement this using session pools:

JavaScript/TypeScript:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    maxConcurrency: 20,  // Total concurrency across all domains

    useSessionPool: true,
    sessionPoolOptions: {
        maxPoolSize: 20,
        sessionOptions: {
            maxUsageCount: 50,  // Rotate sessions after 50 requests
        },
    },

    requestHandler: async ({ page, request, session }) => {
        // Tag sessions by domain for better tracking
        const domain = new URL(request.url).hostname;
        session.userData.domain = domain;

        // Your scraping logic
    },
});

Monitoring and Adjusting Concurrency

You can monitor crawler performance and adjust concurrency dynamically:

JavaScript/TypeScript:

import { PlaywrightCrawler, log } from 'crawlee';

const crawler = new PlaywrightCrawler({
    maxConcurrency: 20,

    requestHandler: async ({ page, request, crawler }) => {
        const stats = await crawler.getStats();

        log.info('Crawler stats', {
            requestsFinished: stats.requestsFinished,
            requestsFailed: stats.requestsFailed,
            retryHistogram: stats.retryHistogram,
            requestAvgFinishedDurationMillis: stats.requestAvgFinishedDurationMillis,
        });

        // Your scraping logic
    },
});

Python:

from crawlee.playwright_crawler import PlaywrightCrawler

crawler = PlaywrightCrawler(max_concurrency=20)

@crawler.router.default_handler
async def request_handler(context):
    stats = await context.crawler.get_stats()

    print(f'Requests finished: {stats.requests_finished}')
    print(f'Requests failed: {stats.requests_failed}')
    print(f'Average duration: {stats.request_avg_finished_duration_millis}ms')

    # Your scraping logic

Practical Concurrency Strategies

Strategy 1: Conservative Approach

For websites that are sensitive to scraping or have strict rate limits:

const crawler = new PlaywrightCrawler({
    maxConcurrency: 2,
    maxRequestsPerMinute: 30,
    requestHandler: async ({ page, request }) => {
        await page.waitForTimeout(2000);  // 2-second delay per request
        // Scraping logic
    },
});

Strategy 2: Balanced Approach

For most general-purpose scraping tasks:

const crawler = new CheerioCrawler({
    maxConcurrency: 10,
    minConcurrency: 3,
    maxRequestsPerMinute: 120,
    requestHandler: async ({ $, request }) => {
        // Scraping logic
    },
});

Strategy 3: Aggressive Approach

For high-performance scraping with robust infrastructure:

const crawler = new PlaywrightCrawler({
    maxConcurrency: 50,
    minConcurrency: 10,
    autoscaledPoolOptions: {
        systemStatusOptions: {
            maxUsedCpuRatio: 0.95,
            maxUsedMemoryRatio: 0.85,
        },
    },
    requestHandler: async ({ page, request }) => {
        // Scraping logic
    },
});

Combining with Proxy Rotation

When handling browser sessions in Puppeteer or using proxies, higher concurrency becomes more viable since requests are distributed across multiple IP addresses:

JavaScript/TypeScript:

const crawler = new PlaywrightCrawler({
    maxConcurrency: 30,  // Higher concurrency with proxies

    proxyConfiguration: await ProxyConfiguration.fromOptions({
        proxyUrls: [
            'http://proxy1.example.com:8000',
            'http://proxy2.example.com:8000',
            'http://proxy3.example.com:8000',
        ],
    }),

    requestHandler: async ({ page, request }) => {
        // Scraping logic
    },
});

Performance Optimization Tips

Match Crawler Type to Content: Use CheerioCrawler for static HTML (highest concurrency), and PlaywrightCrawler for JavaScript-rendered content (lower concurrency due to resource usage).
Monitor Resource Usage: Keep an eye on CPU and memory usage. If you're hitting system limits, reduce maxConcurrency.
Respect Target Servers: Start with conservative settings and gradually increase concurrency while monitoring for errors or blocks.
Use Request Queues Efficiently: Crawlee's request queue management works best when you let the autoscaler handle concurrency automatically.
Test Different Settings: Run benchmarks with different concurrency settings to find the optimal balance for your specific use case.

Common Pitfalls to Avoid

Too High Concurrency: Can lead to memory exhaustion, especially with browser-based crawlers
Too Low Concurrency: Wastes resources and makes scraping unnecessarily slow
Ignoring Rate Limits: Can result in IP bans or temporary blocks
Not Using Sessions: Makes it harder to manage cookies and handle authentication across concurrent requests

Conclusion

Controlling concurrency and parallelism in Crawlee is crucial for building efficient, respectful, and reliable web scrapers. Start with moderate settings, monitor your crawler's performance, and adjust based on your specific requirements and the target website's tolerance. Crawlee's autoscaling features make it easy to achieve optimal performance with minimal configuration, while still providing fine-grained control when needed.

Remember to always respect robots.txt, implement appropriate delays, and monitor your scraping activities to ensure you're being a good web citizen while maximizing efficiency.

Table of contents

How do I Control Concurrency and Parallelism in Crawlee?

Understanding Concurrency in Crawlee

Basic Concurrency Configuration

Setting Maximum Concurrency

Setting Minimum Concurrency

AutoscaledPool Configuration

Request Rate Limiting

Per-Domain Concurrency Limits

Monitoring and Adjusting Concurrency

Practical Concurrency Strategies

Strategy 1: Conservative Approach

Strategy 2: Balanced Approach

Strategy 3: Aggressive Approach

Combining with Proxy Rotation

Performance Optimization Tips

Common Pitfalls to Avoid

Conclusion

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Related Questions

What is autoscaling in Crawlee and how does it work?

How do I optimize Crawlee performance for faster scraping?

How much memory does Crawlee use for large-scale scraping?

Get Started Now

Support