What are the main features of Crawlee for web scraping?
Crawlee is a modern web scraping and browser automation library developed by Apify. It's designed to make building reliable crawlers and scrapers easier by providing a robust set of features that handle common challenges like rate limiting, proxy rotation, and request management. Originally built for Node.js, Crawlee has become a popular choice for developers who need production-grade web scraping capabilities.
Core Features of Crawlee
1. Unified API for Multiple Crawling Modes
Crawlee provides three main crawler types that share a consistent API:
- CheerioCrawler: Fast HTTP crawler for static HTML content
- PlaywrightCrawler: Full-featured browser crawler using Playwright
- PuppeteerCrawler: Full-featured browser crawler using Puppeteer
This unified approach means you can switch between crawlers with minimal code changes:
import { CheerioCrawler } from 'crawlee';
const crawler = new CheerioCrawler({
async requestHandler({ request, $, enqueueLinks }) {
const title = $('title').text();
console.log(`Title of ${request.url}: ${title}`);
// Automatically enqueue all links found on the page
await enqueueLinks();
},
});
await crawler.run(['https://example.com']);
For JavaScript-heavy sites, switch to PlaywrightCrawler:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
async requestHandler({ request, page, enqueueLinks }) {
const title = await page.title();
console.log(`Title of ${request.url}: ${title}`);
await enqueueLinks();
},
});
await crawler.run(['https://example.com']);
2. Automatic Request Queue Management
Crawlee includes a sophisticated request queue system that automatically manages URLs to crawl. The queue handles:
- Deduplication: Automatically prevents crawling the same URL multiple times
- Persistence: Saves queue state to disk or cloud storage
- Priority handling: Allows prioritizing certain requests
- Request retries: Automatically retries failed requests with exponential backoff
import { PlaywrightCrawler, Dataset } from 'crawlee';
const crawler = new PlaywrightCrawler({
maxRequestsPerCrawl: 100,
async requestHandler({ request, page, enqueueLinks }) {
// Extract data
const data = await page.evaluate(() => {
return {
title: document.title,
heading: document.querySelector('h1')?.textContent,
description: document.querySelector('meta[name="description"]')?.content,
};
});
// Save data to default dataset
await Dataset.pushData(data);
// Add more URLs to the queue
await enqueueLinks({
globs: ['https://example.com/blog/**'],
});
},
});
await crawler.run(['https://example.com']);
3. Built-in Proxy Rotation and Session Management
Crawlee handles proxy rotation automatically, which is essential for avoiding blocks and bypassing rate limits:
import { PlaywrightCrawler } from 'crawlee';
import { ProxyConfiguration } from 'crawlee';
const proxyConfiguration = new ProxyConfiguration({
proxyUrls: [
'http://proxy1.example.com:8000',
'http://proxy2.example.com:8000',
'http://proxy3.example.com:8000',
],
});
const crawler = new PlaywrightCrawler({
proxyConfiguration,
sessionPoolOptions: {
maxPoolSize: 20,
sessionOptions: {
maxUsageCount: 50, // Retire session after 50 uses
},
},
async requestHandler({ request, page }) {
// Crawlee automatically rotates proxies and manages sessions
const content = await page.content();
console.log(`Fetched ${request.url} through proxy`);
},
});
await crawler.run(['https://example.com']);
4. Smart Request Throttling and AutoScaling
Crawlee automatically adjusts concurrency based on system resources and response times:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
autoscaledPoolOptions: {
minConcurrency: 1,
maxConcurrency: 50,
desiredConcurrency: 10,
// Automatically scales based on CPU and memory usage
},
maxRequestsPerMinute: 120,
async requestHandler({ request, page }) {
// Your scraping logic here
},
});
await crawler.run(['https://example.com']);
The AutoScaling feature monitors: - System CPU usage - Memory consumption - Request success rates - Response times
It automatically adjusts the number of concurrent requests to optimize performance without overwhelming your system or the target website.
5. Request Interception and Blocking
Crawlee allows you to block unnecessary resources to speed up crawling, similar to handling network requests in Puppeteer:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
preNavigationHooks: [
async ({ page, request }) => {
// Block images, stylesheets, and fonts
await page.route('**/*', (route) => {
const resourceType = route.request().resourceType();
if (['image', 'stylesheet', 'font'].includes(resourceType)) {
route.abort();
} else {
route.continue();
}
});
},
],
async requestHandler({ request, page }) {
// Faster scraping without loading unnecessary resources
},
});
await crawler.run(['https://example.com']);
6. Data Storage Options
Crawlee provides multiple built-in storage options:
import { PlaywrightCrawler, Dataset, KeyValueStore } from 'crawlee';
const crawler = new PlaywrightCrawler({
async requestHandler({ request, page }) {
// Extract product data
const product = await page.evaluate(() => ({
name: document.querySelector('.product-name')?.textContent,
price: document.querySelector('.price')?.textContent,
description: document.querySelector('.description')?.textContent,
}));
// Save to dataset (append-only storage)
await Dataset.pushData(product);
// Save screenshots or files to key-value store
const screenshot = await page.screenshot();
await KeyValueStore.setValue(`screenshot-${request.url}`, screenshot);
},
});
await crawler.run(['https://example.com/products']);
// Export data after crawling
const dataset = await Dataset.open();
await dataset.exportToJSON('products');
7. Error Handling and Retry Logic
Crawlee includes robust error handling with automatic retries:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
maxRequestRetries: 3,
requestHandlerTimeoutSecs: 60,
async requestHandler({ request, page, log }) {
try {
await page.goto(request.url, { waitUntil: 'networkidle' });
// Your scraping logic
const data = await page.evaluate(() => ({
// Extract data
}));
} catch (error) {
log.error(`Error processing ${request.url}`, { error });
throw error; // Crawlee will retry automatically
}
},
async failedRequestHandler({ request, log }) {
log.error(`Request failed after all retries: ${request.url}`);
},
});
await crawler.run(['https://example.com']);
8. TypeScript Support
Crawlee is written in TypeScript and provides excellent type safety:
import { PlaywrightCrawler, Dataset } from 'crawlee';
interface Product {
name: string;
price: number;
inStock: boolean;
}
const crawler = new PlaywrightCrawler({
async requestHandler({ request, page }) {
const product: Product = await page.evaluate(() => ({
name: document.querySelector('.product-name')?.textContent || '',
price: parseFloat(document.querySelector('.price')?.textContent || '0'),
inStock: document.querySelector('.in-stock') !== null,
}));
await Dataset.pushData<Product>(product);
},
});
await crawler.run(['https://example.com/products']);
9. Hooks and Middleware
Crawlee provides lifecycle hooks for customizing behavior:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
preNavigationHooks: [
async ({ page, request }) => {
// Set custom headers or cookies before navigation
await page.setExtraHTTPHeaders({
'Accept-Language': 'en-US,en;q=0.9',
});
},
],
postNavigationHooks: [
async ({ page, request }) => {
// Wait for specific conditions after page load
await page.waitForSelector('.content-loaded');
},
],
async requestHandler({ request, page }) {
// Main scraping logic
},
});
await crawler.run(['https://example.com']);
10. Sitemap and Robot.txt Support
Crawlee can automatically parse and respect robots.txt files and extract URLs from sitemaps:
import { CheerioCrawler, EnqueueStrategy } from 'crawlee';
const crawler = new CheerioCrawler({
async requestHandler({ request, $, enqueueLinks }) {
// Enqueue links while respecting robots.txt
await enqueueLinks({
strategy: EnqueueStrategy.All,
transformRequestFunction: (req) => {
// Modify requests before adding to queue
req.userData = { depth: request.userData.depth + 1 };
return req;
},
});
},
});
// Start from sitemap
await crawler.run(['https://example.com/sitemap.xml']);
Python Support with Crawlee
While Crawlee was originally Node.js-only, the team has recently released a Python version:
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
async def main():
crawler = PlaywrightCrawler(
max_requests_per_crawl=100,
)
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
data = {
'url': context.request.url,
'title': await context.page.title(),
}
await context.push_data(data)
await context.enqueue_links()
await crawler.run(['https://example.com'])
if __name__ == '__main__':
import asyncio
asyncio.run(main())
Advanced Features
Fingerprint Generation
Crawlee can generate browser fingerprints to avoid detection when automating browsers:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
useSessionPool: true,
persistCookiesPerSession: true,
// Generates realistic browser fingerprints
launchContext: {
useChrome: true,
launchOptions: {
headless: true,
},
},
async requestHandler({ request, page }) {
// Crawlee automatically rotates fingerprints per session
},
});
Request Labeling and Routing
Organize different types of requests with labels:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
async requestHandler({ request, page, enqueueLinks }) {
if (request.label === 'CATEGORY') {
// Handle category pages
await enqueueLinks({
globs: ['**/products/**'],
label: 'PRODUCT',
});
} else if (request.label === 'PRODUCT') {
// Handle product pages
const product = await page.evaluate(() => ({
// Extract product data
}));
await Dataset.pushData(product);
}
},
});
await crawler.run([
{ url: 'https://example.com/category', label: 'CATEGORY' },
]);
Conclusion
Crawlee is a comprehensive web scraping framework that combines the power of browser automation tools like Puppeteer with intelligent features like automatic scaling, proxy rotation, and request management. Its unified API makes it easy to switch between different crawling strategies, while built-in features handle common challenges that would otherwise require significant custom code.
Whether you're building a simple scraper or a production-grade crawling system, Crawlee provides the tools and abstractions needed to create reliable, scalable solutions. The framework's automatic handling of proxies, sessions, retries, and resource optimization allows developers to focus on extraction logic rather than infrastructure concerns.
For developers looking for an all-in-one solution that combines ease of use with enterprise-grade features, Crawlee represents a significant advancement in the web scraping ecosystem.