What is the best way to build a JavaScript web scraper with Crawlee?
Building a JavaScript web scraper with Crawlee involves following best practices for architecture, configuration, data extraction, and error handling. Crawlee is a modern web scraping framework that provides powerful features like automatic scaling, proxy rotation, and intelligent request management. This comprehensive guide covers the best approaches to building production-ready web scrapers with Crawlee in JavaScript.
Choosing the Right Crawler Type
Crawlee offers three main crawler types, each optimized for different scenarios:
CheerioCrawler for Static Content
Best for websites that serve pre-rendered HTML without JavaScript:
import { CheerioCrawler, Dataset } from 'crawlee';
const crawler = new CheerioCrawler({
requestHandler: async ({ request, $, enqueueLinks }) => {
const title = $('title').text();
const articles = $('article')
.map((i, el) => ({
heading: $(el).find('h2').text(),
description: $(el).find('p').text(),
url: $(el).find('a').attr('href'),
}))
.get();
await Dataset.pushData({
url: request.url,
title,
articles,
});
await enqueueLinks({
selector: 'a[href]',
strategy: 'same-domain',
});
},
});
await crawler.run(['https://example.com/blog']);
CheerioCrawler is the fastest option, consuming minimal resources since it doesn't launch a browser. Use this when websites don't rely on JavaScript for content rendering.
PlaywrightCrawler for Modern Web Applications
Recommended for JavaScript-heavy sites and single-page applications:
import { PlaywrightCrawler, Dataset } from 'crawlee';
const crawler = new PlaywrightCrawler({
requestHandler: async ({ page, request, enqueueLinks, log }) => {
log.info(`Processing ${request.url}...`);
// Wait for dynamic content to load
await page.waitForSelector('article', { timeout: 5000 });
// Extract data using page.evaluate for complex operations
const data = await page.evaluate(() => {
const articles = Array.from(document.querySelectorAll('article'));
return articles.map(article => ({
heading: article.querySelector('h2')?.textContent.trim(),
description: article.querySelector('p')?.textContent.trim(),
imageUrl: article.querySelector('img')?.src,
author: article.querySelector('.author')?.textContent.trim(),
publishDate: article.querySelector('.date')?.textContent.trim(),
}));
});
await Dataset.pushData({
url: request.url,
scrapedAt: new Date().toISOString(),
articles: data,
});
// Enqueue pagination links
await enqueueLinks({
selector: 'a.pagination-link',
strategy: 'same-domain',
});
},
});
await crawler.run(['https://example.com']);
PlaywrightCrawler provides cross-browser support (Chrome, Firefox, WebKit) and advanced features like network interception and mobile emulation. Similar to handling browser sessions in Puppeteer, Crawlee manages browser contexts efficiently.
PuppeteerCrawler as an Alternative
If you prefer Puppeteer's API or need Chrome DevTools Protocol features:
import { PuppeteerCrawler } from 'crawlee';
const crawler = new PuppeteerCrawler({
requestHandler: async ({ page, request }) => {
await page.waitForSelector('.content');
const data = await page.evaluate(() => {
// Extraction logic
});
// Process and save data
},
});
Project Structure Best Practices
Organize your Crawlee project for maintainability and scalability:
my-scraper/
├── src/
│ ├── main.js # Entry point
│ ├── routes.js # Request handlers and routing logic
│ ├── extractors/
│ │ ├── productExtractor.js
│ │ └── categoryExtractor.js
│ ├── utils/
│ │ ├── validation.js
│ │ └── transforms.js
│ └── config/
│ └── crawler.config.js
├── storage/ # Auto-managed by Crawlee
├── logs/
├── package.json
└── .env
Modular Route Handlers
Separate different page types into dedicated handlers:
// src/routes.js
import { createPlaywrightRouter } from 'crawlee';
export const router = createPlaywrightRouter();
// Handler for product listing pages
router.addHandler('LISTING', async ({ page, enqueueLinks, log }) => {
log.info('Processing listing page');
await enqueueLinks({
selector: 'a.product-link',
label: 'PRODUCT',
});
await enqueueLinks({
selector: 'a.next-page',
label: 'LISTING',
});
});
// Handler for individual product pages
router.addHandler('PRODUCT', async ({ page, request, log }) => {
log.info(`Scraping product: ${request.url}`);
const product = await page.evaluate(() => ({
name: document.querySelector('h1.product-name')?.textContent,
price: document.querySelector('.price')?.textContent,
description: document.querySelector('.description')?.textContent,
specifications: Array.from(document.querySelectorAll('.spec-item'))
.map(item => ({
key: item.querySelector('.spec-key')?.textContent,
value: item.querySelector('.spec-value')?.textContent,
})),
images: Array.from(document.querySelectorAll('.product-image'))
.map(img => img.src),
}));
await Dataset.pushData(product);
});
// Default handler for unlabeled requests
router.addDefaultHandler(async ({ enqueueLinks }) => {
await enqueueLinks({
selector: 'a.category-link',
label: 'LISTING',
});
});
Main Crawler Configuration
// src/main.js
import { PlaywrightCrawler, ProxyConfiguration } from 'crawlee';
import { router } from './routes.js';
const proxyConfiguration = new ProxyConfiguration({
proxyUrls: process.env.PROXY_URLS?.split(','),
});
const crawler = new PlaywrightCrawler({
requestHandler: router,
proxyConfiguration,
// Performance optimization
maxConcurrency: 10,
maxRequestsPerCrawl: 1000,
maxRequestRetries: 3,
// Browser configuration
launchContext: {
launchOptions: {
headless: true,
args: ['--no-sandbox', '--disable-setuid-sandbox'],
},
},
// Session management for better reliability
useSessionPool: true,
sessionPoolOptions: {
maxPoolSize: 100,
sessionOptions: {
maxUsageCount: 50,
},
},
// Request queue configuration
requestHandlerTimeoutSecs: 60,
navigationTimeoutSecs: 30,
});
await crawler.run([
{ url: 'https://example.com/products', label: 'LISTING' },
]);
Advanced Data Extraction Techniques
Waiting for Dynamic Content
When scraping JavaScript-heavy sites, properly wait for content similar to using the waitFor function in Puppeteer:
requestHandler: async ({ page, request }) => {
// Wait for specific selector
await page.waitForSelector('.product-list', { timeout: 10000 });
// Wait for network to be idle
await page.waitForLoadState('networkidle');
// Wait for custom condition
await page.waitForFunction(() => {
return document.querySelectorAll('.product-item').length > 0;
});
// Custom wait with retry logic
const waitForContent = async (selector, maxAttempts = 5) => {
for (let i = 0; i < maxAttempts; i++) {
const element = await page.$(selector);
if (element) return element;
await page.waitForTimeout(1000);
}
throw new Error(`Element ${selector} not found after ${maxAttempts} attempts`);
};
await waitForContent('.dynamic-content');
}
Handling Pagination
Implement robust pagination strategies:
// Method 1: Using enqueueLinks
await enqueueLinks({
selector: 'a.next-page',
transformRequestFunction: (req) => {
req.userData = { pageNumber: (request.userData.pageNumber || 1) + 1 };
return req;
},
});
// Method 2: Manual pagination with page numbers
requestHandler: async ({ page, request, crawler }) => {
const maxPages = 50;
const currentPage = request.userData.pageNumber || 1;
// Scrape current page...
if (currentPage < maxPages) {
const nextPageUrl = `${request.loadedUrl}?page=${currentPage + 1}`;
await crawler.addRequests([{
url: nextPageUrl,
userData: { pageNumber: currentPage + 1 },
}]);
}
}
// Method 3: Infinite scroll handling
requestHandler: async ({ page }) => {
let previousHeight = 0;
let currentHeight = await page.evaluate(() => document.body.scrollHeight);
while (previousHeight !== currentHeight) {
await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
await page.waitForTimeout(2000);
previousHeight = currentHeight;
currentHeight = await page.evaluate(() => document.body.scrollHeight);
}
// Now extract all loaded content
const allItems = await page.$$eval('.item', items =>
items.map(item => ({
title: item.querySelector('.title')?.textContent,
price: item.querySelector('.price')?.textContent,
}))
);
}
Error Handling and Retry Logic
Implement comprehensive error handling:
import { PlaywrightCrawler, log } from 'crawlee';
const crawler = new PlaywrightCrawler({
requestHandler: async ({ page, request }) => {
try {
// Main scraping logic
await page.goto(request.url, { waitUntil: 'networkidle' });
// Check if page loaded correctly
const isBlocked = await page.$('.captcha, .access-denied');
if (isBlocked) {
throw new Error('Page blocked or CAPTCHA detected');
}
// Extract data with error handling
const data = await page.evaluate(() => {
try {
return {
title: document.querySelector('h1')?.textContent || 'N/A',
// More extraction logic
};
} catch (error) {
return { error: error.message };
}
});
await Dataset.pushData(data);
} catch (error) {
log.error(`Error processing ${request.url}:`, error);
// Mark request for retry
if (request.retryCount < 3) {
throw error; // Crawlee will retry automatically
} else {
// Log failed request
await Dataset.pushData({
url: request.url,
error: error.message,
failed: true,
});
}
}
},
failedRequestHandler: async ({ request, error }) => {
log.error(`Request ${request.url} failed after all retries:`, error);
},
});
Performance Optimization
Concurrency Configuration
Balance speed and resource usage:
const crawler = new PlaywrightCrawler({
maxConcurrency: 10, // Maximum parallel requests
minConcurrency: 1, // Minimum parallel requests
maxRequestsPerMinute: 120, // Rate limiting
// Auto-scaling based on system resources
autoscaledPoolOptions: {
maxConcurrency: 50,
desiredConcurrency: 10,
systemStatusOptions: {
maxUsedCpuRatio: 0.90,
maxUsedMemoryRatio: 0.85,
},
},
});
Request Caching and Deduplication
Crawlee automatically deduplicates requests, but you can customize this behavior:
import { RequestQueue } from 'crawlee';
const requestQueue = await RequestQueue.open();
// Add requests with custom uniqueKey for deduplication
await requestQueue.addRequest({
url: 'https://example.com/product?id=123',
uniqueKey: 'product-123', // Custom deduplication key
userData: { productId: 123 },
});
Memory Management
For large-scale scraping, implement proper memory management:
import { Dataset } from 'crawlee';
requestHandler: async ({ page, request }) => {
// Extract data in chunks
const chunkSize = 50;
const totalItems = await page.$$eval('.item', items => items.length);
for (let i = 0; i < totalItems; i += chunkSize) {
const items = await page.$$eval('.item', (elements, start, size) => {
return elements.slice(start, start + size).map(el => ({
// extraction logic
}));
}, i, chunkSize);
// Save data immediately to free memory
await Dataset.pushData(items);
}
}
Data Storage and Export
Crawlee provides flexible data storage options:
import { Dataset, KeyValueStore } from 'crawlee';
// Save to dataset (default storage)
await Dataset.pushData({ title: 'Example', price: 29.99 });
// Export to JSON
const dataset = await Dataset.open();
const data = await dataset.getData();
console.log(data.items);
// Export to CSV via CLI
// npx crawlee export-dataset --format=csv
// Use Key-Value Store for metadata
const store = await KeyValueStore.open();
await store.setValue('scraping-stats', {
startTime: Date.now(),
itemsProcessed: 0,
});
// Retrieve and update
const stats = await store.getValue('scraping-stats');
stats.itemsProcessed += 1;
await store.setValue('scraping-stats', stats);
Custom Export Pipeline
import { Dataset } from 'crawlee';
import fs from 'fs';
// After crawling completes
const dataset = await Dataset.open();
const { items } = await dataset.getData();
// Transform and save
const transformed = items.map(item => ({
...item,
scrapedDate: new Date(item.scrapedAt).toLocaleDateString(),
priceNumeric: parseFloat(item.price.replace(/[^0-9.]/g, '')),
}));
fs.writeFileSync('output.json', JSON.stringify(transformed, null, 2));
Proxy and Session Management
For reliable large-scale scraping:
import { ProxyConfiguration, SessionPool } from 'crawlee';
// Configure proxy rotation
const proxyConfiguration = new ProxyConfiguration({
proxyUrls: [
'http://proxy1.example.com:8080',
'http://proxy2.example.com:8080',
],
newUrlFunction: async () => {
// Dynamic proxy fetching
return 'http://dynamic-proxy.com:8080';
},
});
// Session management for consistent scraping
const crawler = new PlaywrightCrawler({
proxyConfiguration,
useSessionPool: true,
sessionPoolOptions: {
maxPoolSize: 100,
sessionOptions: {
maxUsageCount: 50,
maxErrorScore: 5,
},
},
persistCookiesPerSession: true,
});
Monitoring and Logging
Implement comprehensive monitoring:
import { log, Statistics } from 'crawlee';
// Configure logging
log.setLevel(log.LEVELS.DEBUG);
const crawler = new PlaywrightCrawler({
requestHandler: async ({ page, request }) => {
log.info(`Processing: ${request.url}`);
// Track custom metrics
Statistics.increment('products-scraped');
// Time operations
const startTime = Date.now();
// ... scraping logic ...
const duration = Date.now() - startTime;
log.debug(`Extraction took ${duration}ms`);
},
});
// Periodic stats reporting
setInterval(() => {
const stats = Statistics.getState();
log.info('Current statistics:', stats);
}, 60000);
Best Practices Summary
- Choose the appropriate crawler type: Use CheerioCrawler for static sites, PlaywrightCrawler for dynamic content
- Implement proper error handling: Handle retries, log failures, and gracefully degrade
- Use route handlers: Organize logic by page type for maintainability
- Optimize concurrency: Balance speed with resource consumption
- Implement rate limiting: Respect target websites and avoid bans
- Use sessions and proxies: Rotate IPs and maintain session state
- Wait for dynamic content: Ensure JavaScript-rendered content loads completely
- Structure your data: Define clear schemas for extracted data
- Monitor and log: Track performance and errors for debugging
- Test incrementally: Start with small limits, then scale up
By following these best practices and leveraging Crawlee's built-in features, you can build robust, scalable JavaScript web scrapers that handle modern websites efficiently. When working with complex navigation flows, remember to apply similar techniques to navigating to different pages using Puppeteer within the Crawlee framework.