How do I use PuppeteerCrawler in Crawlee for browser automation?
PuppeteerCrawler is one of the most powerful crawler classes in Crawlee, designed specifically for browser automation tasks. It combines the capabilities of Puppeteer with Crawlee's robust crawling infrastructure, providing features like request queue management, automatic retries, rate limiting, and intelligent session handling.
What is PuppeteerCrawler?
PuppeteerCrawler is a specialized crawler in Crawlee that uses Puppeteer under the hood to control a headless Chrome browser. Unlike simpler HTTP-based crawlers, PuppeteerCrawler can execute JavaScript, interact with dynamic content, handle complex authentication flows, and scrape modern web applications that rely heavily on client-side rendering.
Basic PuppeteerCrawler Setup
Here's a simple example to get started with PuppeteerCrawler:
import { PuppeteerCrawler } from 'crawlee';
const crawler = new PuppeteerCrawler({
async requestHandler({ page, request, enqueueLinks }) {
console.log(`Processing: ${request.url}`);
// Wait for content to load
await page.waitForSelector('h1');
// Extract data from the page
const title = await page.$eval('h1', (el) => el.textContent);
console.log(`Title: ${title}`);
// Enqueue additional links found on the page
await enqueueLinks({
selector: 'a[href]',
label: 'detail',
});
},
maxRequestsPerCrawl: 50,
});
// Add initial URLs to crawl
await crawler.addRequests([
'https://example.com',
]);
// Start the crawler
await crawler.run();
Core Configuration Options
PuppeteerCrawler offers extensive configuration options to control browser behavior and crawling performance:
import { PuppeteerCrawler } from 'crawlee';
const crawler = new PuppeteerCrawler({
// Maximum number of pages to crawl
maxRequestsPerCrawl: 100,
// Maximum concurrency (number of parallel browser tabs)
maxConcurrency: 5,
// Browser launch options
launchContext: {
launchOptions: {
headless: true,
args: ['--no-sandbox', '--disable-setuid-sandbox'],
},
},
// Pre-navigation hooks
preNavigationHooks: [
async ({ page, request }) => {
// Set custom headers
await page.setExtraHTTPHeaders({
'Accept-Language': 'en-US,en;q=0.9',
});
},
],
// Post-navigation hooks
postNavigationHooks: [
async ({ page }) => {
// Wait for specific conditions after navigation
await page.waitForTimeout(2000);
},
],
// Main request handler
async requestHandler({ page, request, log }) {
log.info(`Processing ${request.url}`);
// Your scraping logic here
const data = await page.evaluate(() => {
return {
title: document.title,
bodyText: document.body.innerText,
};
});
await Dataset.pushData(data);
},
// Error handler
async failedRequestHandler({ request, log }) {
log.error(`Request ${request.url} failed too many times`);
},
});
Working with Page Interactions
PuppeteerCrawler excels at handling browser events and complex interactions:
import { PuppeteerCrawler } from 'crawlee';
const crawler = new PuppeteerCrawler({
async requestHandler({ page, request }) {
// Wait for specific elements
await page.waitForSelector('.product-list');
// Click buttons and interact with elements
await page.click('.load-more-button');
await page.waitForTimeout(1000);
// Fill forms
await page.type('#search-input', 'laptops');
await page.click('#search-button');
// Wait for navigation to complete
await page.waitForNavigation({ waitUntil: 'networkidle2' });
// Scroll to load lazy-loaded content
await page.evaluate(() => {
window.scrollTo(0, document.body.scrollHeight);
});
// Take screenshots for debugging
await page.screenshot({
path: `screenshot-${Date.now()}.png`,
fullPage: true,
});
},
});
Advanced Data Extraction
Here's a more complex example showing how to extract structured data:
import { PuppeteerCrawler, Dataset } from 'crawlee';
const crawler = new PuppeteerCrawler({
async requestHandler({ page, request, enqueueLinks }) {
const url = request.url;
if (request.label === 'LIST') {
// Extract product links from listing page
await enqueueLinks({
selector: '.product-card a',
label: 'DETAIL',
});
// Handle pagination
const nextPageExists = await page.$('.pagination .next');
if (nextPageExists) {
await enqueueLinks({
selector: '.pagination .next',
label: 'LIST',
});
}
}
if (request.label === 'DETAIL') {
// Extract detailed product information
const product = await page.evaluate(() => {
const getTextContent = (selector) => {
const element = document.querySelector(selector);
return element ? element.textContent.trim() : null;
};
return {
name: getTextContent('.product-name'),
price: getTextContent('.product-price'),
description: getTextContent('.product-description'),
images: Array.from(document.querySelectorAll('.product-image img'))
.map(img => img.src),
specifications: Array.from(document.querySelectorAll('.spec-item'))
.map(item => ({
key: item.querySelector('.spec-key')?.textContent.trim(),
value: item.querySelector('.spec-value')?.textContent.trim(),
})),
};
});
product.url = url;
product.scrapedAt = new Date().toISOString();
await Dataset.pushData(product);
}
},
});
await crawler.addRequests([
{ url: 'https://example-shop.com/products', label: 'LIST' },
]);
await crawler.run();
Handling Authentication and Sessions
PuppeteerCrawler makes it easy to handle authentication:
import { PuppeteerCrawler } from 'crawlee';
const crawler = new PuppeteerCrawler({
preNavigationHooks: [
async ({ page, request, session }) => {
// Set cookies from session
if (session?.userData?.cookies) {
await page.setCookie(...session.userData.cookies);
}
},
],
async requestHandler({ page, request, session }) {
// Check if we need to login
const isLoginPage = await page.$('#login-form');
if (isLoginPage) {
// Perform login
await page.type('#username', 'your-username');
await page.type('#password', 'your-password');
await page.click('#login-button');
await page.waitForNavigation();
// Save cookies to session
const cookies = await page.cookies();
session.userData.cookies = cookies;
}
// Continue with regular scraping
const data = await page.evaluate(() => ({
title: document.title,
content: document.body.innerText,
}));
await Dataset.pushData(data);
},
});
Request Queue Management
Crawlee automatically manages the request queue, but you can control it explicitly:
import { PuppeteerCrawler, RequestQueue } from 'crawlee';
// Create or open a named request queue
const requestQueue = await RequestQueue.open('my-queue');
const crawler = new PuppeteerCrawler({
requestQueue,
async requestHandler({ page, request, crawler }) {
// Add new requests programmatically
await crawler.addRequests([
{ url: 'https://example.com/page1', label: 'PAGE' },
{ url: 'https://example.com/page2', label: 'PAGE' },
]);
// Or use enqueueLinks helper
await crawler.enqueueLinks({
selector: 'a.product-link',
label: 'PRODUCT',
transformRequestFunction: (req) => {
// Modify requests before adding to queue
req.userData = { category: 'electronics' };
return req;
},
});
},
});
Performance Optimization
To optimize PuppeteerCrawler performance:
import { PuppeteerCrawler } from 'crawlee';
const crawler = new PuppeteerCrawler({
// Control concurrency based on system resources
maxConcurrency: 10,
minConcurrency: 2,
// Adjust request timeouts
requestHandlerTimeoutSecs: 60,
navigationTimeoutSecs: 30,
launchContext: {
launchOptions: {
headless: true,
// Reduce memory usage
args: [
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-dev-shm-usage',
'--disable-accelerated-2d-canvas',
'--no-first-run',
'--no-zygote',
'--disable-gpu',
],
},
// Use Chrome instead of Chromium for better performance
useChrome: true,
},
// Block unnecessary resources
preNavigationHooks: [
async ({ page }) => {
await page.setRequestInterception(true);
page.on('request', (req) => {
const resourceType = req.resourceType();
if (resourceType === 'image' || resourceType === 'stylesheet' || resourceType === 'font') {
req.abort();
} else {
req.continue();
}
});
},
],
});
Handling Dynamic Content and AJAX
When working with AJAX requests and dynamic content:
import { PuppeteerCrawler } from 'crawlee';
const crawler = new PuppeteerCrawler({
async requestHandler({ page, request }) {
// Wait for AJAX content to load
await page.waitForSelector('.ajax-loaded-content', {
visible: true,
timeout: 10000,
});
// Monitor network requests
const responses = [];
page.on('response', async (response) => {
const url = response.url();
if (url.includes('/api/')) {
const data = await response.json().catch(() => null);
if (data) responses.push(data);
}
});
// Trigger AJAX by clicking a button
await page.click('.load-data-button');
// Wait for network to be idle
await page.waitForNetworkIdle({ timeout: 5000 });
// Extract data rendered by AJAX
const dynamicData = await page.evaluate(() => {
return Array.from(document.querySelectorAll('.dynamic-item'))
.map(item => item.textContent.trim());
});
await Dataset.pushData({
url: request.url,
dynamicData,
apiResponses: responses,
});
},
});
Error Handling and Retries
PuppeteerCrawler includes built-in retry mechanisms:
import { PuppeteerCrawler } from 'crawlee';
const crawler = new PuppeteerCrawler({
// Maximum retries for failed requests
maxRequestRetries: 3,
async requestHandler({ page, request, log }) {
try {
await page.goto(request.url, { waitUntil: 'networkidle2' });
// Your scraping logic
const data = await page.evaluate(() => ({
title: document.title,
}));
await Dataset.pushData(data);
} catch (error) {
log.error(`Error processing ${request.url}:`, error);
throw error; // Re-throw to trigger retry
}
},
async failedRequestHandler({ request, log }) {
// This runs after all retries are exhausted
log.error(`Request failed after ${request.retryCount} retries: ${request.url}`);
// Save failed URLs for later review
await Dataset.pushData({
url: request.url,
failed: true,
error: request.errorMessages,
});
},
});
TypeScript Support
Crawlee has excellent TypeScript support:
import { PuppeteerCrawler, Dataset } from 'crawlee';
import type { Page } from 'puppeteer';
interface ProductData {
name: string;
price: number;
url: string;
}
const crawler = new PuppeteerCrawler({
async requestHandler({ page, request }): Promise<void> {
const product: ProductData = await page.evaluate((): ProductData => {
return {
name: document.querySelector('.product-name')?.textContent?.trim() || '',
price: parseFloat(document.querySelector('.price')?.textContent?.replace(/[^0-9.]/g, '') || '0'),
url: window.location.href,
};
});
await Dataset.pushData<ProductData>(product);
},
});
Conclusion
PuppeteerCrawler in Crawlee provides a powerful, production-ready solution for browser automation and web scraping. It combines Puppeteer's browser control capabilities with Crawlee's robust infrastructure for queue management, request handling, and error recovery. Whether you're scraping simple websites or complex single-page applications, PuppeteerCrawler offers the flexibility and reliability needed for professional web scraping projects.
For simpler scraping tasks that don't require JavaScript execution, consider using Crawlee's CheerioCrawler for better performance. For even more modern browser automation with additional features, explore PlaywrightCrawler as an alternative to PuppeteerCrawler.