How do I manage request queues in Crawlee?

Request queues are a fundamental component of Crawlee's architecture, enabling efficient management of URLs to be crawled. Understanding how to properly manage request queues is essential for building scalable and reliable web scrapers. This guide covers everything from basic queue operations to advanced patterns for handling complex scraping scenarios.

What is a Request Queue in Crawlee?

A request queue in Crawlee is a data structure that stores and manages URLs (requests) to be processed by your crawler. It handles deduplication, persistence, and provides methods for adding, retrieving, and marking requests as processed. Crawlee automatically manages the queue lifecycle, ensuring that each URL is processed only once and handling retries for failed requests.

The request queue provides several key benefits:

Automatic deduplication: Prevents processing the same URL multiple times
Persistence: Stores requests on disk or in cloud storage for resumable crawls
Prioritization: Allows you to control which requests are processed first
Concurrency management: Coordinates parallel request processing
Retry handling: Automatically retries failed requests with configurable policies

Basic Request Queue Usage

Creating and Adding Requests

Crawlee automatically creates a default request queue when you instantiate a crawler. Here's how to add requests to the queue:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    async requestHandler({ request, page, enqueueLinks }) {
        console.log(`Processing: ${request.url}`);

        // Add new requests to the queue
        await enqueueLinks({
            selector: 'a[href]',
            strategy: 'same-domain',
        });
    },
});

// Add initial URLs to the queue
await crawler.addRequests([
    'https://example.com',
    'https://example.com/products',
    { url: 'https://example.com/about', label: 'about-page' },
]);

await crawler.run();

Python Implementation

For Python developers using Crawlee for Python:

from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext

async def request_handler(context: PlaywrightCrawlingContext) -> None:
    print(f'Processing: {context.request.url}')

    # Add new requests to the queue
    await context.enqueue_links(
        selector='a[href]',
        strategy='same-domain'
    )

crawler = PlaywrightCrawler(request_handler=request_handler)

# Add initial URLs
await crawler.add_requests([
    'https://example.com',
    'https://example.com/products',
    {'url': 'https://example.com/about', 'label': 'about-page'}
])

await crawler.run()

Advanced Queue Management Techniques

Using Request Labels for Routing

Labels allow you to categorize requests and apply different processing logic based on the request type:

import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ request, $, enqueueLinks }) {
        if (request.label === 'category') {
            // Process category page
            console.log(`Processing category: ${request.url}`);

            // Enqueue product links
            await enqueueLinks({
                selector: '.product-link',
                label: 'product',
            });
        } else if (request.label === 'product') {
            // Process product page
            const title = $('h1.product-title').text();
            const price = $('.product-price').text();

            console.log({ title, price });
        }
    },
});

await crawler.addRequests([
    { url: 'https://example.com/category/electronics', label: 'category' },
    { url: 'https://example.com/category/books', label: 'category' },
]);

await crawler.run();

Custom Request Filtering

You can implement custom logic to filter which requests should be added to the queue:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    async requestHandler({ request, page, enqueueLinks }) {
        await enqueueLinks({
            // Only enqueue links matching specific patterns
            transformRequestFunction: (req) => {
                // Skip external resources
                if (req.url.includes('/static/') ||
                    req.url.includes('/assets/')) {
                    return false;
                }

                // Add custom data to the request
                req.userData = {
                    depth: (request.userData.depth || 0) + 1,
                    parentUrl: request.url,
                };

                // Skip if depth exceeds limit
                if (req.userData.depth > 3) {
                    return false;
                }

                return req;
            },
        });
    },
});

Setting Request Priority

Control the order in which requests are processed by setting priority values:

await crawler.addRequests([
    {
        url: 'https://example.com/important-page',
        priority: 10,  // Higher priority processed first
    },
    {
        url: 'https://example.com/normal-page',
        priority: 0,   // Default priority
    },
    {
        url: 'https://example.com/low-priority',
        priority: -10, // Lower priority processed last
    },
]);

Working with RequestQueue Directly

For advanced use cases, you can interact with the request queue directly:

import { RequestQueue } from 'crawlee';

// Get or create a named queue
const requestQueue = await RequestQueue.open('my-queue');

// Add a single request
await requestQueue.addRequest({
    url: 'https://example.com',
    userData: { category: 'electronics' },
});

// Add multiple requests in batch
await requestQueue.addRequestsBatched([
    { url: 'https://example.com/page1' },
    { url: 'https://example.com/page2' },
    { url: 'https://example.com/page3' },
]);

// Check if queue is empty
const isEmpty = await requestQueue.isEmpty();
console.log(`Queue is empty: ${isEmpty}`);

// Get queue info
const info = await requestQueue.getInfo();
console.log(`Pending requests: ${info.pendingRequestCount}`);
console.log(`Handled requests: ${info.handledRequestCount}`);

// Fetch a request to process
const request = await requestQueue.fetchNextRequest();
if (request) {
    console.log(`Processing: ${request.url}`);

    // Mark request as handled
    await requestQueue.markRequestHandled(request);
}

Request Queue Persistence and Storage

Crawlee stores request queues persistently, allowing you to resume interrupted crawls:

import { PlaywrightCrawler, Configuration } from 'crawlee';

// Configure storage location
const config = new Configuration({
    storageDir: './my-crawl-storage',
    persistStorage: true,
});

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ request, page }) => {
        console.log(`Processing: ${request.url}`);
        // Your scraping logic
    },
}, config);

// This will resume from where it left off if interrupted
await crawler.run([
    'https://example.com',
]);

Using Named Queues

Named queues allow you to maintain multiple independent queues:

import { CheerioCrawler, RequestQueue } from 'crawlee';

// Create separate queues for different tasks
const highPriorityQueue = await RequestQueue.open('high-priority');
const lowPriorityQueue = await RequestQueue.open('low-priority');

// Add requests to specific queues
await highPriorityQueue.addRequest({
    url: 'https://example.com/urgent',
});

await lowPriorityQueue.addRequest({
    url: 'https://example.com/batch-job',
});

// Use a specific queue with crawler
const crawler = new CheerioCrawler({
    requestQueue: highPriorityQueue,
    requestHandler: async ({ request, $ }) => {
        console.log(`Processing high priority: ${request.url}`);
    },
});

Handling Dynamic URL Discovery

When crawling sites with dynamically generated links, proper queue management becomes crucial. Similar to how you might handle AJAX requests using Puppeteer to wait for dynamic content, Crawlee provides built-in mechanisms to discover and queue URLs that appear after page interactions:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    async requestHandler({ request, page, enqueueLinks }) {
        // Wait for initial content
        await page.waitForSelector('.product-list', { timeout: 5000 });

        // Scroll to load more content
        let previousHeight = 0;
        let currentHeight = await page.evaluate(() => document.body.scrollHeight);

        while (previousHeight !== currentHeight) {
            previousHeight = currentHeight;
            await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
            await page.waitForTimeout(2000);
            currentHeight = await page.evaluate(() => document.body.scrollHeight);
        }

        // Enqueue all discovered links
        await enqueueLinks({
            selector: 'a.product-link',
            strategy: 'same-domain',
        });
    },
});

Best Practices for Queue Management

1. Use Request Labels for Complex Workflows

Organize your crawling logic by categorizing requests with labels:

const crawler = new CheerioCrawler({
    async requestHandler({ request, $, enqueueLinks }) {
        switch (request.label) {
            case 'START':
                await enqueueLinks({
                    selector: '.category-link',
                    label: 'CATEGORY',
                });
                break;
            case 'CATEGORY':
                await enqueueLinks({
                    selector: '.product-link',
                    label: 'PRODUCT',
                });
                break;
            case 'PRODUCT':
                // Extract product data
                await saveProductData($);
                break;
        }
    },
});

2. Implement Depth Limiting

Prevent infinite crawling by tracking and limiting crawl depth:

const MAX_DEPTH = 3;

const crawler = new PlaywrightCrawler({
    async requestHandler({ request, enqueueLinks }) {
        const currentDepth = request.userData.depth || 0;

        if (currentDepth < MAX_DEPTH) {
            await enqueueLinks({
                transformRequestFunction: (req) => {
                    req.userData = { depth: currentDepth + 1 };
                    return req;
                },
            });
        }
    },
});

// Initialize with depth 0
await crawler.addRequests([
    { url: 'https://example.com', userData: { depth: 0 } },
]);

3. Monitor Queue Statistics

Keep track of queue performance and progress:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    async requestHandler({ request, crawler }) {
        // Your scraping logic

        // Periodically log queue statistics
        const stats = await crawler.requestQueue.getInfo();
        if (stats.handledRequestCount % 100 === 0) {
            console.log(`Progress: ${stats.handledRequestCount} handled, ${stats.pendingRequestCount} pending`);
        }
    },
});

4. Handle Failed Requests Gracefully

Configure retry policies and failure handling:

const crawler = new PlaywrightCrawler({
    maxRequestRetries: 3,
    requestHandlerTimeoutSecs: 60,

    async failedRequestHandler({ request, error }) {
        console.error(`Request ${request.url} failed: ${error.message}`);

        // Log failed URLs for later processing
        await logFailedRequest(request);
    },
});

Integration with Browser Automation

When working with browser-based crawlers, queue management often involves coordination with page navigation and session handling. Just as you need to understand how to handle browser sessions in Puppeteer, Crawlee provides similar mechanisms for managing browser contexts across queued requests:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    useSessionPool: true,
    persistCookiesPerSession: true,

    async requestHandler({ request, page, session }) {
        console.log(`Processing ${request.url} with session ${session.id}`);

        // Session state is maintained across requests
        if (request.label === 'login') {
            await page.fill('#username', 'user');
            await page.fill('#password', 'pass');
            await page.click('button[type="submit"]');

            // Subsequent requests will use authenticated session
            await crawler.addRequests([
                { url: 'https://example.com/dashboard', label: 'authenticated' },
            ]);
        }
    },
});

Exporting and Importing Queue State

For debugging or migration purposes, you can export queue state:

import { RequestQueue } from 'crawlee';

const queue = await RequestQueue.open('my-queue');

// Get all requests from the queue
const { items } = await queue.listHead({ limit: 1000 });

console.log(`Exporting ${items.length} requests`);

// Export to JSON
const exportData = items.map(item => ({
    url: item.url,
    label: item.label,
    userData: item.userData,
}));

await fs.writeFile('queue-export.json', JSON.stringify(exportData, null, 2));

// Import from JSON
const importData = JSON.parse(await fs.readFile('queue-export.json', 'utf8'));
await queue.addRequestsBatched(importData);

Conclusion

Effective request queue management is crucial for building robust web scrapers with Crawlee. By understanding the fundamentals of queue operations, implementing proper filtering and prioritization, and following best practices for depth limiting and error handling, you can create scalable crawlers that efficiently process thousands or millions of URLs.

The key takeaways are:

Use request labels to organize complex crawling workflows
Implement depth limiting to prevent infinite crawls
Leverage priority settings for important requests
Monitor queue statistics to track progress
Configure appropriate retry policies for reliability
Use named queues for managing multiple crawling tasks

With these techniques, you'll be well-equipped to handle any web scraping scenario, from simple single-page extractions to complex multi-stage crawling operations across entire websites.

Table of contents