How do I manage request queues in Crawlee?
Request queues are a fundamental component of Crawlee's architecture, enabling efficient management of URLs to be crawled. Understanding how to properly manage request queues is essential for building scalable and reliable web scrapers. This guide covers everything from basic queue operations to advanced patterns for handling complex scraping scenarios.
What is a Request Queue in Crawlee?
A request queue in Crawlee is a data structure that stores and manages URLs (requests) to be processed by your crawler. It handles deduplication, persistence, and provides methods for adding, retrieving, and marking requests as processed. Crawlee automatically manages the queue lifecycle, ensuring that each URL is processed only once and handling retries for failed requests.
The request queue provides several key benefits:
- Automatic deduplication: Prevents processing the same URL multiple times
- Persistence: Stores requests on disk or in cloud storage for resumable crawls
- Prioritization: Allows you to control which requests are processed first
- Concurrency management: Coordinates parallel request processing
- Retry handling: Automatically retries failed requests with configurable policies
Basic Request Queue Usage
Creating and Adding Requests
Crawlee automatically creates a default request queue when you instantiate a crawler. Here's how to add requests to the queue:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
async requestHandler({ request, page, enqueueLinks }) {
console.log(`Processing: ${request.url}`);
// Add new requests to the queue
await enqueueLinks({
selector: 'a[href]',
strategy: 'same-domain',
});
},
});
// Add initial URLs to the queue
await crawler.addRequests([
'https://example.com',
'https://example.com/products',
{ url: 'https://example.com/about', label: 'about-page' },
]);
await crawler.run();
Python Implementation
For Python developers using Crawlee for Python:
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
async def request_handler(context: PlaywrightCrawlingContext) -> None:
print(f'Processing: {context.request.url}')
# Add new requests to the queue
await context.enqueue_links(
selector='a[href]',
strategy='same-domain'
)
crawler = PlaywrightCrawler(request_handler=request_handler)
# Add initial URLs
await crawler.add_requests([
'https://example.com',
'https://example.com/products',
{'url': 'https://example.com/about', 'label': 'about-page'}
])
await crawler.run()
Advanced Queue Management Techniques
Using Request Labels for Routing
Labels allow you to categorize requests and apply different processing logic based on the request type:
import { CheerioCrawler } from 'crawlee';
const crawler = new CheerioCrawler({
async requestHandler({ request, $, enqueueLinks }) {
if (request.label === 'category') {
// Process category page
console.log(`Processing category: ${request.url}`);
// Enqueue product links
await enqueueLinks({
selector: '.product-link',
label: 'product',
});
} else if (request.label === 'product') {
// Process product page
const title = $('h1.product-title').text();
const price = $('.product-price').text();
console.log({ title, price });
}
},
});
await crawler.addRequests([
{ url: 'https://example.com/category/electronics', label: 'category' },
{ url: 'https://example.com/category/books', label: 'category' },
]);
await crawler.run();
Custom Request Filtering
You can implement custom logic to filter which requests should be added to the queue:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
async requestHandler({ request, page, enqueueLinks }) {
await enqueueLinks({
// Only enqueue links matching specific patterns
transformRequestFunction: (req) => {
// Skip external resources
if (req.url.includes('/static/') ||
req.url.includes('/assets/')) {
return false;
}
// Add custom data to the request
req.userData = {
depth: (request.userData.depth || 0) + 1,
parentUrl: request.url,
};
// Skip if depth exceeds limit
if (req.userData.depth > 3) {
return false;
}
return req;
},
});
},
});
Setting Request Priority
Control the order in which requests are processed by setting priority values:
await crawler.addRequests([
{
url: 'https://example.com/important-page',
priority: 10, // Higher priority processed first
},
{
url: 'https://example.com/normal-page',
priority: 0, // Default priority
},
{
url: 'https://example.com/low-priority',
priority: -10, // Lower priority processed last
},
]);
Working with RequestQueue Directly
For advanced use cases, you can interact with the request queue directly:
import { RequestQueue } from 'crawlee';
// Get or create a named queue
const requestQueue = await RequestQueue.open('my-queue');
// Add a single request
await requestQueue.addRequest({
url: 'https://example.com',
userData: { category: 'electronics' },
});
// Add multiple requests in batch
await requestQueue.addRequestsBatched([
{ url: 'https://example.com/page1' },
{ url: 'https://example.com/page2' },
{ url: 'https://example.com/page3' },
]);
// Check if queue is empty
const isEmpty = await requestQueue.isEmpty();
console.log(`Queue is empty: ${isEmpty}`);
// Get queue info
const info = await requestQueue.getInfo();
console.log(`Pending requests: ${info.pendingRequestCount}`);
console.log(`Handled requests: ${info.handledRequestCount}`);
// Fetch a request to process
const request = await requestQueue.fetchNextRequest();
if (request) {
console.log(`Processing: ${request.url}`);
// Mark request as handled
await requestQueue.markRequestHandled(request);
}
Request Queue Persistence and Storage
Crawlee stores request queues persistently, allowing you to resume interrupted crawls:
import { PlaywrightCrawler, Configuration } from 'crawlee';
// Configure storage location
const config = new Configuration({
storageDir: './my-crawl-storage',
persistStorage: true,
});
const crawler = new PlaywrightCrawler({
requestHandler: async ({ request, page }) => {
console.log(`Processing: ${request.url}`);
// Your scraping logic
},
}, config);
// This will resume from where it left off if interrupted
await crawler.run([
'https://example.com',
]);
Using Named Queues
Named queues allow you to maintain multiple independent queues:
import { CheerioCrawler, RequestQueue } from 'crawlee';
// Create separate queues for different tasks
const highPriorityQueue = await RequestQueue.open('high-priority');
const lowPriorityQueue = await RequestQueue.open('low-priority');
// Add requests to specific queues
await highPriorityQueue.addRequest({
url: 'https://example.com/urgent',
});
await lowPriorityQueue.addRequest({
url: 'https://example.com/batch-job',
});
// Use a specific queue with crawler
const crawler = new CheerioCrawler({
requestQueue: highPriorityQueue,
requestHandler: async ({ request, $ }) => {
console.log(`Processing high priority: ${request.url}`);
},
});
Handling Dynamic URL Discovery
When crawling sites with dynamically generated links, proper queue management becomes crucial. Similar to how you might handle AJAX requests using Puppeteer to wait for dynamic content, Crawlee provides built-in mechanisms to discover and queue URLs that appear after page interactions:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
async requestHandler({ request, page, enqueueLinks }) {
// Wait for initial content
await page.waitForSelector('.product-list', { timeout: 5000 });
// Scroll to load more content
let previousHeight = 0;
let currentHeight = await page.evaluate(() => document.body.scrollHeight);
while (previousHeight !== currentHeight) {
previousHeight = currentHeight;
await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
await page.waitForTimeout(2000);
currentHeight = await page.evaluate(() => document.body.scrollHeight);
}
// Enqueue all discovered links
await enqueueLinks({
selector: 'a.product-link',
strategy: 'same-domain',
});
},
});
Best Practices for Queue Management
1. Use Request Labels for Complex Workflows
Organize your crawling logic by categorizing requests with labels:
const crawler = new CheerioCrawler({
async requestHandler({ request, $, enqueueLinks }) {
switch (request.label) {
case 'START':
await enqueueLinks({
selector: '.category-link',
label: 'CATEGORY',
});
break;
case 'CATEGORY':
await enqueueLinks({
selector: '.product-link',
label: 'PRODUCT',
});
break;
case 'PRODUCT':
// Extract product data
await saveProductData($);
break;
}
},
});
2. Implement Depth Limiting
Prevent infinite crawling by tracking and limiting crawl depth:
const MAX_DEPTH = 3;
const crawler = new PlaywrightCrawler({
async requestHandler({ request, enqueueLinks }) {
const currentDepth = request.userData.depth || 0;
if (currentDepth < MAX_DEPTH) {
await enqueueLinks({
transformRequestFunction: (req) => {
req.userData = { depth: currentDepth + 1 };
return req;
},
});
}
},
});
// Initialize with depth 0
await crawler.addRequests([
{ url: 'https://example.com', userData: { depth: 0 } },
]);
3. Monitor Queue Statistics
Keep track of queue performance and progress:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
async requestHandler({ request, crawler }) {
// Your scraping logic
// Periodically log queue statistics
const stats = await crawler.requestQueue.getInfo();
if (stats.handledRequestCount % 100 === 0) {
console.log(`Progress: ${stats.handledRequestCount} handled, ${stats.pendingRequestCount} pending`);
}
},
});
4. Handle Failed Requests Gracefully
Configure retry policies and failure handling:
const crawler = new PlaywrightCrawler({
maxRequestRetries: 3,
requestHandlerTimeoutSecs: 60,
async failedRequestHandler({ request, error }) {
console.error(`Request ${request.url} failed: ${error.message}`);
// Log failed URLs for later processing
await logFailedRequest(request);
},
});
Integration with Browser Automation
When working with browser-based crawlers, queue management often involves coordination with page navigation and session handling. Just as you need to understand how to handle browser sessions in Puppeteer, Crawlee provides similar mechanisms for managing browser contexts across queued requests:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
useSessionPool: true,
persistCookiesPerSession: true,
async requestHandler({ request, page, session }) {
console.log(`Processing ${request.url} with session ${session.id}`);
// Session state is maintained across requests
if (request.label === 'login') {
await page.fill('#username', 'user');
await page.fill('#password', 'pass');
await page.click('button[type="submit"]');
// Subsequent requests will use authenticated session
await crawler.addRequests([
{ url: 'https://example.com/dashboard', label: 'authenticated' },
]);
}
},
});
Exporting and Importing Queue State
For debugging or migration purposes, you can export queue state:
import { RequestQueue } from 'crawlee';
const queue = await RequestQueue.open('my-queue');
// Get all requests from the queue
const { items } = await queue.listHead({ limit: 1000 });
console.log(`Exporting ${items.length} requests`);
// Export to JSON
const exportData = items.map(item => ({
url: item.url,
label: item.label,
userData: item.userData,
}));
await fs.writeFile('queue-export.json', JSON.stringify(exportData, null, 2));
// Import from JSON
const importData = JSON.parse(await fs.readFile('queue-export.json', 'utf8'));
await queue.addRequestsBatched(importData);
Conclusion
Effective request queue management is crucial for building robust web scrapers with Crawlee. By understanding the fundamentals of queue operations, implementing proper filtering and prioritization, and following best practices for depth limiting and error handling, you can create scalable crawlers that efficiently process thousands or millions of URLs.
The key takeaways are:
- Use request labels to organize complex crawling workflows
- Implement depth limiting to prevent infinite crawls
- Leverage priority settings for important requests
- Monitor queue statistics to track progress
- Configure appropriate retry policies for reliability
- Use named queues for managing multiple crawling tasks
With these techniques, you'll be well-equipped to handle any web scraping scenario, from simple single-page extractions to complex multi-stage crawling operations across entire websites.