How do I use Crawlee with Node.js for web scraping?
Crawlee is a powerful web scraping and browser automation library for Node.js, developed by Apify. It provides a unified interface for building reliable web scrapers with built-in features like request routing, automatic retries, proxy rotation, and intelligent crawling. This guide will show you how to use Crawlee with Node.js for efficient web scraping.
Prerequisites
Before getting started with Crawlee, ensure you have:
- Node.js version 16 or higher installed
- npm or yarn package manager
- Basic understanding of JavaScript and async/await syntax
- Familiarity with HTML and CSS selectors
Installing Crawlee
First, create a new Node.js project and install Crawlee:
# Create a new project directory
mkdir my-crawler
cd my-crawler
# Initialize a new Node.js project
npm init -y
# Install Crawlee
npm install crawlee
# Install Playwright (for browser-based scraping)
npx playwright install
Crawlee supports multiple HTTP clients and browser automation tools. The main options are:
- CheerioCrawler: Lightweight HTTP requests with Cheerio for HTML parsing
- PuppeteerCrawler: Uses Puppeteer for browser automation
- PlaywrightCrawler: Uses Playwright for modern browser automation
Basic Web Scraping with CheerioCrawler
CheerioCrawler is ideal for scraping static websites that don't require JavaScript execution. Here's a basic example:
import { CheerioCrawler, log } from 'crawlee';
// Create a new CheerioCrawler instance
const crawler = new CheerioCrawler({
// Maximum number of concurrent requests
maxConcurrency: 10,
// Request handler function
async requestHandler({ request, $, enqueueLinks }) {
log.info(`Processing: ${request.url}`);
// Extract data using Cheerio selectors
const title = $('h1').text().trim();
const description = $('meta[name="description"]').attr('content');
const links = [];
$('a').each((i, el) => {
links.push($(el).attr('href'));
});
// Store the extracted data
await crawler.pushData({
url: request.url,
title,
description,
linkCount: links.length,
});
// Enqueue additional URLs to crawl
await enqueueLinks({
// Only follow links matching this pattern
globs: ['https://example.com/**'],
// Exclude certain patterns
exclude: ['**/archive/**'],
});
},
// Handle failed requests
failedRequestHandler({ request, error }) {
log.error(`Request ${request.url} failed: ${error.message}`);
},
});
// Start the crawler with initial URLs
await crawler.run(['https://example.com']);
// Export the collected data
await crawler.exportData('results.json');
Browser-Based Scraping with PlaywrightCrawler
For websites that require JavaScript execution or browser automation similar to Puppeteer, use PlaywrightCrawler:
import { PlaywrightCrawler, log } from 'crawlee';
const crawler = new PlaywrightCrawler({
// Launch browser options
launchContext: {
launchOptions: {
headless: true,
timeout: 60000,
},
},
// Request handler with page context
async requestHandler({ request, page, enqueueLinks, pushData }) {
log.info(`Scraping: ${request.url}`);
// Wait for dynamic content to load
await page.waitForSelector('.product-list', { timeout: 10000 });
// Scroll to load lazy-loaded content
await page.evaluate(() => {
window.scrollTo(0, document.body.scrollHeight);
});
// Wait for additional content
await page.waitForTimeout(2000);
// Extract data from the page
const products = await page.$$eval('.product-item', (items) => {
return items.map(item => ({
name: item.querySelector('.product-name')?.textContent.trim(),
price: item.querySelector('.product-price')?.textContent.trim(),
image: item.querySelector('img')?.src,
}));
});
// Save the extracted data
await pushData({
url: request.url,
products,
scrapedAt: new Date().toISOString(),
});
// Find and enqueue pagination links
await enqueueLinks({
selector: '.pagination a',
});
},
// Maximum number of pages to crawl
maxRequestsPerCrawl: 100,
});
await crawler.run(['https://example-shop.com/products']);
Advanced Features
Request Queue Management
Crawlee provides built-in request queue management with automatic deduplication:
import { CheerioCrawler, RequestQueue } from 'crawlee';
// Create a named request queue
const requestQueue = await RequestQueue.open('my-queue');
// Add requests manually
await requestQueue.addRequest({
url: 'https://example.com/page1',
userData: { category: 'electronics' },
});
await requestQueue.addRequest({
url: 'https://example.com/page2',
userData: { category: 'books' },
});
const crawler = new CheerioCrawler({
requestQueue,
async requestHandler({ request, $ }) {
const category = request.userData.category;
log.info(`Scraping ${category} from ${request.url}`);
// Your scraping logic here
},
});
await crawler.run();
Using Proxy Servers
Crawlee makes it easy to rotate proxies and avoid IP blocks:
import { PlaywrightCrawler, ProxyConfiguration } from 'crawlee';
// Configure proxy rotation
const proxyConfiguration = new ProxyConfiguration({
proxyUrls: [
'http://proxy1.example.com:8000',
'http://proxy2.example.com:8000',
'http://proxy3.example.com:8000',
],
});
const crawler = new PlaywrightCrawler({
proxyConfiguration,
async requestHandler({ request, page, proxyInfo }) {
log.info(`Using proxy: ${proxyInfo?.url}`);
// Your scraping logic
},
});
await crawler.run(['https://example.com']);
Session Management and Cookie Handling
Maintain sessions across requests with Crawlee's session management:
import { CheerioCrawler, SessionPool } from 'crawlee';
const crawler = new CheerioCrawler({
useSessionPool: true,
sessionPoolOptions: {
maxPoolSize: 20,
sessionOptions: {
maxUsageCount: 50, // Retire session after 50 uses
maxErrorScore: 3, // Retire session after 3 errors
},
},
async requestHandler({ request, session, $ }) {
// Access session cookies
const cookies = session.getCookies();
log.info(`Session ID: ${session.id}, Cookies: ${cookies.length}`);
// Set custom cookies
session.setCookies([
{ name: 'user_preference', value: 'dark_mode', domain: '.example.com' },
], 'https://example.com');
// Your scraping logic
},
});
await crawler.run(['https://example.com']);
Handling Rate Limiting and Auto-Scaling
Crawlee automatically adjusts concurrency based on system resources:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
// Start with 1 concurrent request
minConcurrency: 1,
// Scale up to 20 concurrent requests
maxConcurrency: 20,
// Auto-scale based on system load
autoscaledPoolOptions: {
desiredConcurrency: 10,
maxConcurrency: 20,
systemStatusOptions: {
// Consider system overloaded at 90% CPU
maxCpuOverloadedRatio: 0.9,
// Consider system overloaded at 90% memory
maxMemoryOverloadedRatio: 0.9,
},
},
async requestHandler({ request, page }) {
// Your scraping logic
},
});
await crawler.run(['https://example.com']);
Handling Authentication
For sites requiring login, you can handle authentication before scraping:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
preNavigationHooks: [
async ({ page, request }) => {
// Only login on first request
if (request.url.includes('login')) {
await page.fill('#username', 'your-username');
await page.fill('#password', 'your-password');
await page.click('button[type="submit"]');
await page.waitForNavigation();
}
},
],
async requestHandler({ request, page }) {
// Scrape authenticated content
const data = await page.$$eval('.user-content', (elements) => {
return elements.map(el => el.textContent);
});
await page.context().storageState({ path: 'auth.json' });
},
});
await crawler.run(['https://example.com/login', 'https://example.com/dashboard']);
Data Storage and Export
Crawlee provides flexible data storage options:
import { CheerioCrawler, Dataset } from 'crawlee';
const crawler = new CheerioCrawler({
async requestHandler({ request, $, pushData }) {
const data = {
url: request.url,
title: $('h1').text(),
timestamp: new Date().toISOString(),
};
// Push data to default dataset
await pushData(data);
},
});
await crawler.run(['https://example.com']);
// Export data in various formats
const dataset = await Dataset.open();
await dataset.exportToJSON('output.json');
await dataset.exportToCSV('output.csv');
await dataset.exportToHTML('output.html');
Error Handling and Retries
Crawlee automatically retries failed requests with exponential backoff:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
// Retry failed requests up to 3 times
maxRequestRetries: 3,
// Custom error handling
errorHandler({ error, request }) {
log.error(`Request ${request.url} failed: ${error.message}`);
// Mark certain errors as non-retryable
if (error.message.includes('404')) {
request.noRetry = true;
}
},
async requestHandler({ request, page }) {
try {
// Your scraping logic with timeout
await page.waitForSelector('.content', { timeout: 30000 });
} catch (error) {
if (error.message.includes('timeout')) {
log.warning(`Timeout on ${request.url}, will retry`);
throw error; // Let Crawlee retry
}
}
},
});
await crawler.run(['https://example.com']);
Best Practices
1. Use Appropriate Crawler Type
- Use CheerioCrawler for static HTML pages (faster, lower memory)
- Use PlaywrightCrawler for JavaScript-heavy sites
- Use PuppeteerCrawler if you're already familiar with Puppeteer
2. Implement Proper Selectors
// Use specific selectors to avoid brittle scrapers
const products = await page.$$eval('[data-testid="product-item"]', (items) => {
return items.map(item => ({
id: item.getAttribute('data-product-id'),
name: item.querySelector('[data-testid="product-name"]')?.textContent,
}));
});
3. Handle Dynamic Content
When scraping single-page applications, wait for content to load:
await page.waitForSelector('.dynamic-content', {
state: 'visible',
timeout: 10000
});
// Or wait for network idle
await page.waitForLoadState('networkidle');
4. Monitor and Debug
Enable detailed logging for debugging:
import { log, LogLevel } from 'crawlee';
// Set log level
log.setLevel(LogLevel.DEBUG);
// Add custom logging
log.debug('Debug message');
log.info('Info message');
log.warning('Warning message');
log.error('Error message');
Conclusion
Crawlee provides a robust, production-ready framework for web scraping with Node.js. Its built-in features like automatic retries, proxy rotation, session management, and intelligent crawling make it an excellent choice for both simple and complex scraping projects. By following the examples and best practices in this guide, you can build reliable and efficient web scrapers that handle real-world challenges.
For more advanced scenarios requiring browser automation, consider exploring how Crawlee integrates with Playwright and Puppeteer for handling dynamic content, authentication, and complex user interactions.