How do I set up a Node.js web crawler using Crawlee?
Crawlee is a powerful web scraping and browser automation library for Node.js that provides a unified interface for building reliable crawlers. Setting up a Node.js web crawler with Crawlee is straightforward and offers built-in features like request queuing, automatic retries, and proxy rotation.
Prerequisites
Before setting up Crawlee, ensure you have:
- Node.js (version 16 or higher)
- npm or yarn package manager
- Basic understanding of JavaScript/TypeScript and async/await
Installing Crawlee
First, create a new Node.js project and install Crawlee:
# Create a new project directory
mkdir my-crawler
cd my-crawler
# Initialize a new Node.js project
npm init -y
# Install Crawlee
npm install crawlee
# Install Playwright for browser automation (optional)
npm install playwright
Crawlee supports multiple HTTP clients and browser automation tools: - Cheerio - Fast HTML parsing without a browser - Puppeteer - Chrome/Chromium automation - Playwright - Multi-browser automation (Chrome, Firefox, Safari)
Basic Crawlee Setup with CheerioCrawler
For simple HTML scraping without JavaScript rendering, use CheerioCrawler
:
const { CheerioCrawler, Dataset } = require('crawlee');
// Create a new crawler instance
const crawler = new CheerioCrawler({
// Maximum number of concurrent requests
maxConcurrency: 10,
// Request handler - processes each page
async requestHandler({ request, $, log }) {
log.info(`Processing ${request.url}...`);
// Extract data using jQuery-like syntax
const title = $('title').text();
const headings = [];
$('h1, h2').each((index, element) => {
headings.push($(element).text().trim());
});
// Save extracted data
await Dataset.pushData({
url: request.url,
title,
headings,
});
// Enqueue new URLs found on the page
await crawler.addRequests(
$('a[href]').map((_, el) => $(el).attr('href')).get()
);
},
// Handle failed requests
failedRequestHandler({ request, log }) {
log.error(`Request ${request.url} failed too many times.`);
},
});
// Run the crawler
await crawler.run(['https://example.com']);
Setting Up Crawlee with Playwright
For JavaScript-heavy websites, use PlaywrightCrawler
to render pages in a real browser:
const { PlaywrightCrawler, Dataset } = require('crawlee');
const crawler = new PlaywrightCrawler({
// Launch browser in headless mode
headless: true,
// Browser type: 'chromium', 'firefox', or 'webkit'
browserPoolOptions: {
useFingerprints: true, // Avoid bot detection
},
async requestHandler({ page, request, log, enqueueLinks }) {
log.info(`Scraping ${request.url}...`);
// Wait for dynamic content to load
await page.waitForLoadState('networkidle');
// Extract data from the page
const data = await page.evaluate(() => {
return {
title: document.querySelector('title')?.textContent,
description: document.querySelector('meta[name="description"]')?.content,
links: Array.from(document.querySelectorAll('a')).map(a => a.href),
};
});
// Save the data
await Dataset.pushData({
url: request.url,
...data,
});
// Find and enqueue new links
await enqueueLinks({
selector: 'a[href]',
strategy: 'same-domain', // Only crawl same domain
});
},
});
await crawler.run(['https://example.com']);
Advanced Configuration
Request Queue and Storage
Crawlee automatically manages request queues and data storage:
const { PlaywrightCrawler, Dataset, RequestQueue } = require('crawlee');
// Initialize a named request queue
const requestQueue = await RequestQueue.open('my-queue');
// Add initial URLs
await requestQueue.addRequest({
url: 'https://example.com',
userData: { depth: 0 } // Custom metadata
});
const crawler = new PlaywrightCrawler({
requestQueue,
async requestHandler({ request, page, log, enqueueLinks }) {
const { depth } = request.userData;
// Limit crawl depth
if (depth >= 3) {
log.info(`Max depth reached for ${request.url}`);
return;
}
// Extract and save data
const data = await page.evaluate(() => ({
title: document.title,
url: window.location.href,
}));
await Dataset.pushData(data);
// Enqueue links with incremented depth
await enqueueLinks({
transformRequestFunction: (req) => {
req.userData = { depth: depth + 1 };
return req;
},
});
},
});
await crawler.run();
Proxy Configuration
Add proxy support to avoid IP blocking:
const { PlaywrightCrawler, ProxyConfiguration } = require('crawlee');
// Configure proxy rotation
const proxyConfiguration = new ProxyConfiguration({
proxyUrls: [
'http://proxy1.example.com:8000',
'http://proxy2.example.com:8000',
],
// Or use proxy services like Apify Proxy
// proxyUrls: ['http://groups-RESIDENTIAL:password@proxy.apify.com:8000'],
});
const crawler = new PlaywrightCrawler({
proxyConfiguration,
useSessionPool: true, // Maintain sessions across requests
async requestHandler({ request, page, log }) {
log.info(`Processing ${request.url} via proxy...`);
// Your scraping logic here
},
});
await crawler.run(['https://example.com']);
Session Management and Cookies
Handle authentication and sessions:
const { PlaywrightCrawler } = require('crawlee');
const crawler = new PlaywrightCrawler({
useSessionPool: true,
persistCookiesPerSession: true,
async requestHandler({ page, session, log }) {
// Check if session is blocked
if (session.isBlocked()) {
log.warning('Session blocked, retiring...');
session.retire();
return;
}
// Handle login if needed
if (!session.userData.loggedIn) {
await page.goto('https://example.com/login');
await page.fill('#username', 'user@example.com');
await page.fill('#password', 'password');
await page.click('button[type="submit"]');
session.userData.loggedIn = true;
}
// Continue scraping authenticated pages
},
});
Error Handling and Retries
Crawlee provides robust error handling:
const crawler = new PlaywrightCrawler({
maxRequestRetries: 3,
maxRequestsPerMinute: 120,
async requestHandler({ request, page, log }) {
try {
await page.goto(request.url, {
waitUntil: 'networkidle',
timeout: 30000
});
// Your scraping logic
} catch (error) {
log.error(`Error processing ${request.url}: ${error.message}`);
throw error; // Will trigger retry
}
},
failedRequestHandler({ request, log }) {
log.error(`Failed to process ${request.url} after retries`);
},
});
Working with TypeScript
Crawlee has excellent TypeScript support:
import { PlaywrightCrawler, Dataset } from 'crawlee';
interface ProductData {
url: string;
title: string;
price: number;
availability: boolean;
}
const crawler = new PlaywrightCrawler({
async requestHandler({ page, request, log, enqueueLinks }) {
log.info(`Processing ${request.url}`);
const data: ProductData = await page.evaluate(() => {
return {
url: window.location.href,
title: document.querySelector('h1.product-title')?.textContent || '',
price: parseFloat(document.querySelector('.price')?.textContent || '0'),
availability: !!document.querySelector('.in-stock'),
};
});
await Dataset.pushData<ProductData>(data);
await enqueueLinks({
selector: 'a.product-link',
label: 'PRODUCT',
});
},
});
await crawler.run(['https://shop.example.com']);
Exporting Data
Export scraped data in various formats:
const { Dataset } = require('crawlee');
// After crawling is complete
const dataset = await Dataset.open('my-results');
// Export to JSON
const data = await dataset.getData();
console.log(data.items);
// Export to CSV
await dataset.exportToCSV('results.csv');
// Export to JSON file
await dataset.exportToJSON('results.json');
// Get data in chunks for large datasets
const { items } = await dataset.getData({
offset: 0,
limit: 100
});
Complete Example: E-commerce Crawler
Here's a production-ready example for crawling an e-commerce site:
const { PlaywrightCrawler, Dataset, ProxyConfiguration } = require('crawlee');
const proxyConfiguration = new ProxyConfiguration({
proxyUrls: process.env.PROXY_URLS?.split(',') || [],
});
const crawler = new PlaywrightCrawler({
proxyConfiguration,
maxConcurrency: 5,
maxRequestsPerMinute: 60,
useSessionPool: true,
async requestHandler({ page, request, log, enqueueLinks }) {
log.info(`Crawling ${request.url}`);
// Handle different page types
if (request.label === 'CATEGORY') {
await enqueueLinks({
selector: 'a.product-card',
label: 'PRODUCT',
});
await enqueueLinks({
selector: 'a.pagination-next',
label: 'CATEGORY',
});
}
if (request.label === 'PRODUCT') {
// Wait for product details to load
await page.waitForSelector('.product-details', { timeout: 10000 });
const product = await page.evaluate(() => ({
name: document.querySelector('h1.product-name')?.textContent?.trim(),
price: document.querySelector('.price')?.textContent?.trim(),
description: document.querySelector('.description')?.textContent?.trim(),
images: Array.from(document.querySelectorAll('.product-image img'))
.map(img => img.src),
inStock: !!document.querySelector('.in-stock'),
}));
await Dataset.pushData({
url: request.url,
...product,
scrapedAt: new Date().toISOString(),
});
}
},
failedRequestHandler({ request, log }) {
log.error(`Request failed: ${request.url}`);
},
});
// Start crawling
await crawler.run([
{ url: 'https://shop.example.com/categories', label: 'CATEGORY' }
]);
// Export results
const dataset = await Dataset.open();
await dataset.exportToJSON('products.json');
Integration with Browser Automation
Crawlee integrates seamlessly with browser automation tools. For handling complex interactions like navigating to different pages or managing browser sessions, you can leverage Crawlee's built-in Puppeteer and Playwright support while benefiting from its queue management and retry logic.
Best Practices
- Start Simple: Begin with
CheerioCrawler
for static sites, upgrade toPlaywrightCrawler
only when needed - Respect Robots.txt: Use
robotsTxtParser
to check allowed paths - Use Rate Limiting: Configure
maxRequestsPerMinute
to avoid overwhelming servers - Handle Errors Gracefully: Implement proper error handling and retry logic
- Monitor Performance: Use Crawlee's built-in logging and statistics
- Use Proxies: Rotate proxies to avoid IP bans
- Implement Depth Limits: Prevent infinite crawling with depth checks
- Clean Up Resources: Properly close browser instances and clean temporary files
Conclusion
Setting up a Node.js web crawler with Crawlee provides a robust foundation for web scraping projects. The library handles complex tasks like request queuing, retries, and proxy rotation automatically, allowing you to focus on extracting the data you need. Whether you're building a simple HTML scraper or a sophisticated browser-based crawler, Crawlee offers the tools and flexibility to handle various scraping scenarios efficiently.
For dynamic websites requiring JavaScript execution, consider using handling AJAX requests techniques in combination with Crawlee's browser automation capabilities to ensure all content is properly loaded before extraction.