How do I optimize Crawlee performance for faster scraping?
Optimizing Crawlee performance is crucial for large-scale web scraping projects. By fine-tuning concurrency settings, managing resources efficiently, and implementing smart crawling strategies, you can significantly improve scraping speed while maintaining reliability. This guide covers essential techniques to maximize Crawlee's performance.
Understanding Crawlee's Performance Architecture
Crawlee is built with performance in mind, featuring automatic scaling, request queue management, and intelligent retry mechanisms. However, achieving optimal performance requires understanding how to configure these features for your specific use case.
The main performance factors include:
- Concurrency: Number of parallel requests
- Request queue management: How URLs are prioritized and processed
- Resource utilization: Memory and CPU usage
- Network efficiency: Request throttling and retry strategies
- Data storage: How scraped data is persisted
Optimizing Concurrency Settings
Concurrency is the most impactful factor for scraping speed. Crawlee allows you to control how many requests run simultaneously.
Basic Concurrency Configuration
import { CheerioCrawler } from 'crawlee';
const crawler = new CheerioCrawler({
maxConcurrency: 50, // Maximum concurrent requests
minConcurrency: 10, // Minimum concurrent requests
requestHandlerTimeoutSecs: 60,
async requestHandler({ request, $, log }) {
// Your scraping logic
const title = $('h1').text();
log.info(`Title: ${title}`);
},
});
await crawler.run(['https://example.com']);
Dynamic Concurrency with Autoscaling
Crawlee's autoscaling feature automatically adjusts concurrency based on system resources:
import { PuppeteerCrawler, Configuration } from 'crawlee';
const crawler = new PuppeteerCrawler({
maxConcurrency: 100,
minConcurrency: 5,
// Autoscaling configuration
autoscaledPoolOptions: {
desiredConcurrency: 50, // Target concurrency
maxConcurrency: 100,
minConcurrency: 5,
// Scaling thresholds
systemStatusOptions: {
maxUsedMemoryRatio: 0.8, // Scale down if memory usage exceeds 80%
maxUsedCpuRatio: 0.85, // Scale down if CPU usage exceeds 85%
},
},
async requestHandler({ page, request, log }) {
await page.waitForSelector('h1');
const title = await page.title();
log.info(`Scraped: ${title}`);
},
});
await crawler.run(['https://example.com']);
Choosing the Right Crawler Type
Different crawler types have different performance characteristics:
CheerioCrawler - Fastest Option
For static websites, CheerioCrawler
is the fastest because it doesn't require a browser:
import { CheerioCrawler } from 'crawlee';
const crawler = new CheerioCrawler({
maxConcurrency: 100, // Can handle very high concurrency
async requestHandler({ request, $, enqueueLinks }) {
const data = {
title: $('h1').first().text(),
description: $('meta[name="description"]').attr('content'),
};
await enqueueLinks({
globs: ['https://example.com/products/*'],
});
},
});
Performance tip: Use CheerioCrawler
whenever possible, as it's 10-20x faster than browser-based crawlers.
PuppeteerCrawler - For Dynamic Content
When you need to handle AJAX requests or JavaScript-rendered content:
import { PuppeteerCrawler } from 'crawlee';
const crawler = new PuppeteerCrawler({
maxConcurrency: 20, // Lower concurrency for browser-based scraping
launchContext: {
launchOptions: {
headless: true,
args: [
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-dev-shm-usage',
'--disable-accelerated-2d-canvas',
'--disable-gpu',
],
},
},
async requestHandler({ page, request }) {
// Wait for dynamic content
await page.waitForSelector('.product-list', { timeout: 10000 });
const products = await page.$$eval('.product', elements =>
elements.map(el => ({
name: el.querySelector('.name')?.textContent,
price: el.querySelector('.price')?.textContent,
}))
);
},
});
Request Queue Optimization
Efficient request queue management prevents bottlenecks and ensures smooth crawling.
Using RequestList for Known URLs
When you have a predefined list of URLs, use RequestList
for better performance:
import { CheerioCrawler, RequestList } from 'crawlee';
// Prepare request list
const requestList = await RequestList.open('my-list', [
'https://example.com/page1',
'https://example.com/page2',
// ... thousands of URLs
]);
const crawler = new CheerioCrawler({
requestList,
maxConcurrency: 50,
async requestHandler({ request, $ }) {
// Your scraping logic
},
});
await crawler.run();
Request Queue Priority
Prioritize important URLs to scrape them first:
import { CheerioCrawler } from 'crawlee';
const crawler = new CheerioCrawler({
async requestHandler({ request, $, crawler }) {
await crawler.addRequests([
{
url: 'https://example.com/high-priority',
userData: { priority: 1 }, // Higher priority
},
{
url: 'https://example.com/low-priority',
userData: { priority: 10 }, // Lower priority
},
]);
},
});
Memory Management and Resource Optimization
Proper resource management prevents memory leaks and ensures stable long-running scrapers.
Configuring Storage Options
import { Configuration, PuppeteerCrawler } from 'crawlee';
const config = new Configuration({
persistStorage: true,
storageDir: './crawlee_storage',
purgeOnStart: false, // Keep data between runs
});
const crawler = new PuppeteerCrawler({
maxRequestsPerCrawl: 10000, // Limit total requests
maxRequestsPerMinute: 120, // Rate limiting
// Close pages after processing to free memory
browserPoolOptions: {
maxOpenPagesPerBrowser: 10,
retireInstanceAfterRequestCount: 100,
},
async requestHandler({ page, request }) {
// Scraping logic
},
}, config);
Efficient Data Storage
Use datasets efficiently to avoid memory issues:
import { CheerioCrawler, Dataset } from 'crawlee';
const crawler = new CheerioCrawler({
async requestHandler({ request, $ }) {
const data = {
url: request.url,
title: $('h1').text(),
timestamp: new Date().toISOString(),
};
// Push data immediately instead of accumulating in memory
await Dataset.pushData(data);
},
});
Network Optimization Strategies
Request Throttling
Implement smart throttling to avoid overwhelming target servers:
import { CheerioCrawler } from 'crawlee';
const crawler = new CheerioCrawler({
maxRequestsPerMinute: 60, // Limit to 60 requests per minute
maxRequestRetries: 3, // Retry failed requests up to 3 times
// Exponential backoff for retries
maxRequestRetries: 5,
requestHandlerTimeoutSecs: 60,
async requestHandler({ request, $ }) {
// Your scraping logic
},
});
Session Management
Use session management for better performance with authenticated sites:
import { PuppeteerCrawler } from 'crawlee';
const crawler = new PuppeteerCrawler({
useSessionPool: true,
sessionPoolOptions: {
maxPoolSize: 20,
sessionOptions: {
maxUsageCount: 50, // Reuse sessions for multiple requests
},
},
async requestHandler({ page, session }) {
// Session cookies are automatically managed
await page.goto(request.url);
},
});
Advanced Performance Techniques
Disable Unnecessary Resources
For browser-based scrapers, block images, fonts, and other resources:
import { PuppeteerCrawler } from 'crawlee';
const crawler = new PuppeteerCrawler({
preNavigationHooks: [
async ({ page, request }) => {
await page.setRequestInterception(true);
page.on('request', (req) => {
const resourceType = req.resourceType();
// Block images, fonts, and stylesheets
if (['image', 'font', 'stylesheet'].includes(resourceType)) {
req.abort();
} else {
req.continue();
}
});
},
],
async requestHandler({ page }) {
// Scraping logic - page loads faster without images
},
});
Parallel Processing with Multiple Crawlers
For very large projects, run multiple crawlers in parallel:
import { CheerioCrawler } from 'crawlee';
async function crawlCategory(category, urls) {
const crawler = new CheerioCrawler({
maxConcurrency: 30,
async requestHandler({ request, $ }) {
// Category-specific scraping logic
},
});
await crawler.run(urls);
}
// Run multiple crawlers in parallel
await Promise.all([
crawlCategory('electronics', electronicsUrls),
crawlCategory('books', booksUrls),
crawlCategory('clothing', clothingUrls),
]);
Monitoring and Benchmarking
Track performance metrics to identify bottlenecks:
import { CheerioCrawler } from 'crawlee';
const crawler = new CheerioCrawler({
async requestHandler({ request, $, log }) {
const startTime = Date.now();
// Your scraping logic
const title = $('h1').text();
const duration = Date.now() - startTime;
log.info(`Processed ${request.url} in ${duration}ms`);
},
async failedRequestHandler({ request, error, log }) {
log.error(`Request failed: ${request.url}`, { error: error.message });
},
});
// Access statistics after crawling
await crawler.run(['https://example.com']);
const stats = await crawler.getStats();
console.log(`Total requests: ${stats.requestsFinished}`);
console.log(`Failed requests: ${stats.requestsFailed}`);
Performance Checklist
To maximize Crawlee performance:
- Use CheerioCrawler for static content when possible
- Tune concurrency based on your system resources and target site
- Enable autoscaling to automatically adjust concurrency
- Implement request throttling to avoid rate limiting
- Use RequestList for known URLs instead of adding them one by one
- Manage sessions efficiently for authenticated scraping
- Disable unnecessary resources (images, fonts) for browser-based scraping
- Limit maxRequestsPerCrawl for memory-constrained environments
- Push data immediately to datasets instead of accumulating in memory
- Monitor performance metrics to identify and resolve bottlenecks
Conclusion
Optimizing Crawlee performance requires a balanced approach between speed and reliability. Start with conservative concurrency settings and gradually increase them while monitoring system resources. Choose the appropriate crawler type for your use case, implement efficient resource management, and use advanced techniques like request filtering and parallel processing for maximum performance.
By following these optimization strategies, you can build fast, reliable, and scalable web scrapers that efficiently handle large-scale data extraction projects.