What Are the Best Practices for Web Scraping with Crawlee?
Crawlee is a powerful web scraping and browser automation library for Node.js and Python that provides built-in features for robust, scalable scraping. Following best practices ensures your scrapers are efficient, maintainable, and respectful of target websites. This guide covers essential practices for building production-ready Crawlee scrapers.
1. Choose the Right Crawler Type
Crawlee offers multiple crawler classes, each optimized for different scenarios:
CheerioCrawler for Static Content
Use CheerioCrawler for fast scraping of static HTML pages. It's the most efficient option when JavaScript execution isn't required.
import { CheerioCrawler } from 'crawlee';
const crawler = new CheerioCrawler({
requestHandler: async ({ request, $, enqueueLinks }) => {
const title = $('h1').text();
const description = $('.description').text();
await enqueueLinks({
globs: ['https://example.com/products/*'],
});
return { url: request.url, title, description };
},
});
await crawler.run(['https://example.com']);
PuppeteerCrawler or PlaywrightCrawler for Dynamic Content
When dealing with JavaScript-rendered content, single-page applications, or pages requiring browser automation, use PuppeteerCrawler or PlaywrightCrawler.
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
requestHandler: async ({ page, request, enqueueLinks }) => {
// Wait for dynamic content to load
await page.waitForSelector('.product-list');
const products = await page.$$eval('.product', items =>
items.map(item => ({
name: item.querySelector('.name')?.textContent,
price: item.querySelector('.price')?.textContent,
}))
);
await enqueueLinks({
selector: '.pagination a',
});
return { url: request.url, products };
},
headless: true,
maxRequestRetries: 3,
});
await crawler.run(['https://example.com/products']);
2. Implement Proper Error Handling
Robust error handling prevents scraper crashes and ensures data consistency.
Use Request Error Handlers
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
requestHandler: async ({ page, request, log }) => {
try {
await page.goto(request.url, { waitUntil: 'networkidle' });
// Scraping logic here
} catch (error) {
log.error(`Failed to process ${request.url}`, { error });
throw error; // Let Crawlee handle retries
}
},
failedRequestHandler: async ({ request, log }) => {
log.error(`Request ${request.url} failed after ${request.retryCount} retries`);
// Store failed URLs for later review
},
maxRequestRetries: 3,
maxRequestsPerMinute: 60,
});
Handle Timeouts Gracefully
Similar to handling timeouts in Puppeteer, configure appropriate timeout values:
const crawler = new PlaywrightCrawler({
requestHandler: async ({ page, request }) => {
await page.goto(request.url, {
timeout: 30000, // 30 seconds
waitUntil: 'domcontentloaded'
});
await page.waitForSelector('.content', {
timeout: 10000
});
},
requestHandlerTimeoutSecs: 60,
});
3. Optimize Performance and Concurrency
Configure Concurrency Settings
const crawler = new CheerioCrawler({
maxConcurrency: 10, // Run up to 10 requests simultaneously
minConcurrency: 2, // Keep at least 2 workers active
maxRequestsPerMinute: 120, // Respect rate limits
requestHandler: async ({ request, $ }) => {
// Scraping logic
},
});
Use AutoscaledPool for Dynamic Scaling
Crawlee's autoscaling automatically adjusts concurrency based on system resources:
const crawler = new PlaywrightCrawler({
autoscaledPoolOptions: {
minConcurrency: 1,
maxConcurrency: 50,
desiredConcurrency: 10,
systemStatusOptions: {
maxUsedCpuRatio: 0.8, // Pause if CPU usage exceeds 80%
maxUsedMemoryRatio: 0.7, // Pause if memory usage exceeds 70%
},
},
});
4. Manage Request Queues Effectively
Use Request Queue for Persistent Storage
import { PlaywrightCrawler, RequestQueue } from 'crawlee';
const requestQueue = await RequestQueue.open('my-queue');
// Add initial URLs
await requestQueue.addRequest({ url: 'https://example.com' });
const crawler = new PlaywrightCrawler({
requestHandler: async ({ request, enqueueLinks }) => {
await enqueueLinks({
globs: ['https://example.com/category/*'],
requestQueue, // Use the same queue
});
},
});
await crawler.run();
Filter and Prioritize Requests
const crawler = new CheerioCrawler({
requestHandler: async ({ request, enqueueLinks }) => {
await enqueueLinks({
globs: ['https://example.com/**'],
// Filter out unwanted URLs
transformRequestFunction: (req) => {
if (req.url.includes('login') || req.url.includes('signup')) {
return false; // Skip these URLs
}
// Prioritize product pages
if (req.url.includes('/product/')) {
req.userData.priority = 10;
}
return req;
},
});
},
});
5. Implement Session and Proxy Management
Use Session Pools for Authentication
import { PlaywrightCrawler, SessionPool } from 'crawlee';
const sessionPool = await SessionPool.open({
maxPoolSize: 10,
sessionOptions: {
maxUsageCount: 50, // Retire session after 50 uses
maxErrorScore: 3, // Retire session after 3 errors
},
});
const crawler = new PlaywrightCrawler({
useSessionPool: true,
sessionPoolOptions: sessionPool,
requestHandler: async ({ page, session }) => {
// Session cookies are automatically managed
const isLoggedIn = await page.$('.user-menu');
if (!isLoggedIn) {
session.retire(); // Mark session as invalid
}
},
});
Configure Proxy Rotation
import { PlaywrightCrawler, ProxyConfiguration } from 'crawlee';
const proxyConfiguration = new ProxyConfiguration({
proxyUrls: [
'http://proxy1.example.com:8000',
'http://proxy2.example.com:8000',
],
});
const crawler = new PlaywrightCrawler({
proxyConfiguration,
requestHandler: async ({ page, proxyInfo }) => {
console.log(`Using proxy: ${proxyInfo.url}`);
// Scraping logic
},
});
6. Store Data Efficiently
Use Datasets for Structured Data
import { PlaywrightCrawler, Dataset } from 'crawlee';
const crawler = new PlaywrightCrawler({
requestHandler: async ({ page, request }) => {
const data = await page.evaluate(() => ({
title: document.querySelector('h1')?.textContent,
price: document.querySelector('.price')?.textContent,
description: document.querySelector('.description')?.textContent,
}));
// Push data to default dataset
await Dataset.pushData({
url: request.url,
scrapedAt: new Date().toISOString(),
...data,
});
},
});
await crawler.run(['https://example.com/product']);
// Export data after scraping
const dataset = await Dataset.open();
await dataset.exportToJSON('results');
await dataset.exportToCSV('results');
Use Key-Value Stores for State Management
import { KeyValueStore } from 'crawlee';
const store = await KeyValueStore.open('my-store');
// Save scraping state
await store.setValue('checkpoint', {
lastProcessedUrl: 'https://example.com/page/100',
processedCount: 1000,
timestamp: Date.now(),
});
// Resume from checkpoint
const checkpoint = await store.getValue('checkpoint');
if (checkpoint) {
console.log(`Resuming from ${checkpoint.lastProcessedUrl}`);
}
7. Follow Ethical Scraping Practices
Respect robots.txt
While Crawlee doesn't enforce robots.txt by default, you should respect it:
import { CheerioCrawler } from 'crawlee';
import robotsParser from 'robots-parser';
const crawler = new CheerioCrawler({
requestHandler: async ({ request, crawler }) => {
// Check robots.txt before scraping
const response = await fetch(`${new URL(request.url).origin}/robots.txt`);
const robotsTxt = await response.text();
const robots = robotsParser(request.url, robotsTxt);
if (!robots.isAllowed(request.url, 'Crawlee')) {
throw new Error('URL disallowed by robots.txt');
}
// Continue with scraping
},
});
Implement Rate Limiting
const crawler = new CheerioCrawler({
maxRequestsPerMinute: 60, // 60 requests per minute
maxRequestsPerCrawl: 1000, // Limit total requests
requestHandler: async ({ request, log }) => {
log.info(`Processing: ${request.url}`);
// Add delays if needed
await new Promise(resolve => setTimeout(resolve, 1000));
},
});
Set Proper User-Agent
const crawler = new PlaywrightCrawler({
launchContext: {
launchOptions: {
userAgent: 'MyBot/1.0 (https://mywebsite.com/bot-info; contact@mywebsite.com)',
},
},
});
8. Monitor and Log Effectively
Use Built-in Logging
import { PlaywrightCrawler, log } from 'crawlee';
log.setLevel(log.LEVELS.DEBUG); // Set global log level
const crawler = new PlaywrightCrawler({
requestHandler: async ({ request, log }) => {
log.info(`Processing ${request.url}`);
log.debug('Detailed debug information');
try {
// Scraping logic
} catch (error) {
log.error('Scraping failed', { error, url: request.url });
}
},
});
Track Statistics
const crawler = new CheerioCrawler({
requestHandler: async ({ request }) => {
// Scraping logic
},
});
await crawler.run(['https://example.com']);
// Get statistics after run
const stats = await crawler.stats;
console.log(`Requests processed: ${stats.requestsFinished}`);
console.log(`Requests failed: ${stats.requestsFailed}`);
console.log(`Average processing time: ${stats.requestAvgFinishedDurationMillis}ms`);
9. Python-Specific Best Practices
For Python users, Crawlee for Python offers similar functionality:
from crawlee.playwright_crawler import PlaywrightCrawler
from crawlee import Request
crawler = PlaywrightCrawler(
max_requests_per_crawl=100,
max_request_retries=3,
max_requests_per_minute=60,
)
@crawler.router.default_handler
async def request_handler(context):
# Type-safe context with IDE autocomplete
url = context.request.url
page = context.page
await page.wait_for_selector('.content')
title = await page.query_selector('.title')
title_text = await title.inner_text() if title else None
await context.push_data({
'url': url,
'title': title_text,
})
# Enqueue links
await context.enqueue_links(
selector='a.product-link',
globs=['https://example.com/products/*'],
)
await crawler.run(['https://example.com'])
10. Testing and Debugging
Test with Small Datasets
const crawler = new CheerioCrawler({
maxRequestsPerCrawl: 10, // Limit for testing
requestHandler: async ({ request, log }) => {
log.debug(`Testing with ${request.url}`);
// Your scraping logic
},
});
Use Headless: false for Debugging
When debugging browser-based crawlers, you can follow similar approaches to handling errors in Puppeteer:
const crawler = new PlaywrightCrawler({
headless: false, // Show browser window
launchContext: {
launchOptions: {
slowMo: 100, // Slow down operations by 100ms
devtools: true, // Open DevTools
},
},
});
Conclusion
Following these best practices will help you build robust, efficient, and maintainable web scrapers with Crawlee. Key takeaways include:
- Choose the appropriate crawler type for your use case
- Implement comprehensive error handling and retry logic
- Optimize performance with proper concurrency settings
- Manage requests, sessions, and proxies effectively
- Store data efficiently using Datasets and Key-Value Stores
- Follow ethical scraping practices and respect website policies
- Monitor and log scraper activity for debugging and optimization
By adhering to these guidelines, you'll create scrapers that are not only effective but also respectful of target websites and maintainable in the long term. Whether you're scraping static HTML or complex JavaScript applications, Crawlee provides the tools you need to implement these best practices successfully.