How do I use PuppeteerCrawler in Crawlee for browser automation?

PuppeteerCrawler is one of the most powerful crawler classes in Crawlee, designed specifically for browser automation tasks. It combines the capabilities of Puppeteer with Crawlee's robust crawling infrastructure, providing features like request queue management, automatic retries, rate limiting, and intelligent session handling.

What is PuppeteerCrawler?

PuppeteerCrawler is a specialized crawler in Crawlee that uses Puppeteer under the hood to control a headless Chrome browser. Unlike simpler HTTP-based crawlers, PuppeteerCrawler can execute JavaScript, interact with dynamic content, handle complex authentication flows, and scrape modern web applications that rely heavily on client-side rendering.

Basic PuppeteerCrawler Setup

Here's a simple example to get started with PuppeteerCrawler:

import { PuppeteerCrawler } from 'crawlee';

const crawler = new PuppeteerCrawler({
    async requestHandler({ page, request, enqueueLinks }) {
        console.log(`Processing: ${request.url}`);

        // Wait for content to load
        await page.waitForSelector('h1');

        // Extract data from the page
        const title = await page.$eval('h1', (el) => el.textContent);
        console.log(`Title: ${title}`);

        // Enqueue additional links found on the page
        await enqueueLinks({
            selector: 'a[href]',
            label: 'detail',
        });
    },
    maxRequestsPerCrawl: 50,
});

// Add initial URLs to crawl
await crawler.addRequests([
    'https://example.com',
]);

// Start the crawler
await crawler.run();

Core Configuration Options

PuppeteerCrawler offers extensive configuration options to control browser behavior and crawling performance:

import { PuppeteerCrawler } from 'crawlee';

const crawler = new PuppeteerCrawler({
    // Maximum number of pages to crawl
    maxRequestsPerCrawl: 100,

    // Maximum concurrency (number of parallel browser tabs)
    maxConcurrency: 5,

    // Browser launch options
    launchContext: {
        launchOptions: {
            headless: true,
            args: ['--no-sandbox', '--disable-setuid-sandbox'],
        },
    },

    // Pre-navigation hooks
    preNavigationHooks: [
        async ({ page, request }) => {
            // Set custom headers
            await page.setExtraHTTPHeaders({
                'Accept-Language': 'en-US,en;q=0.9',
            });
        },
    ],

    // Post-navigation hooks
    postNavigationHooks: [
        async ({ page }) => {
            // Wait for specific conditions after navigation
            await page.waitForTimeout(2000);
        },
    ],

    // Main request handler
    async requestHandler({ page, request, log }) {
        log.info(`Processing ${request.url}`);

        // Your scraping logic here
        const data = await page.evaluate(() => {
            return {
                title: document.title,
                bodyText: document.body.innerText,
            };
        });

        await Dataset.pushData(data);
    },

    // Error handler
    async failedRequestHandler({ request, log }) {
        log.error(`Request ${request.url} failed too many times`);
    },
});

Working with Page Interactions

PuppeteerCrawler excels at handling browser events and complex interactions:

import { PuppeteerCrawler } from 'crawlee';

const crawler = new PuppeteerCrawler({
    async requestHandler({ page, request }) {
        // Wait for specific elements
        await page.waitForSelector('.product-list');

        // Click buttons and interact with elements
        await page.click('.load-more-button');
        await page.waitForTimeout(1000);

        // Fill forms
        await page.type('#search-input', 'laptops');
        await page.click('#search-button');

        // Wait for navigation to complete
        await page.waitForNavigation({ waitUntil: 'networkidle2' });

        // Scroll to load lazy-loaded content
        await page.evaluate(() => {
            window.scrollTo(0, document.body.scrollHeight);
        });

        // Take screenshots for debugging
        await page.screenshot({
            path: `screenshot-${Date.now()}.png`,
            fullPage: true,
        });
    },
});

Advanced Data Extraction

Here's a more complex example showing how to extract structured data:

import { PuppeteerCrawler, Dataset } from 'crawlee';

const crawler = new PuppeteerCrawler({
    async requestHandler({ page, request, enqueueLinks }) {
        const url = request.url;

        if (request.label === 'LIST') {
            // Extract product links from listing page
            await enqueueLinks({
                selector: '.product-card a',
                label: 'DETAIL',
            });

            // Handle pagination
            const nextPageExists = await page.$('.pagination .next');
            if (nextPageExists) {
                await enqueueLinks({
                    selector: '.pagination .next',
                    label: 'LIST',
                });
            }
        }

        if (request.label === 'DETAIL') {
            // Extract detailed product information
            const product = await page.evaluate(() => {
                const getTextContent = (selector) => {
                    const element = document.querySelector(selector);
                    return element ? element.textContent.trim() : null;
                };

                return {
                    name: getTextContent('.product-name'),
                    price: getTextContent('.product-price'),
                    description: getTextContent('.product-description'),
                    images: Array.from(document.querySelectorAll('.product-image img'))
                        .map(img => img.src),
                    specifications: Array.from(document.querySelectorAll('.spec-item'))
                        .map(item => ({
                            key: item.querySelector('.spec-key')?.textContent.trim(),
                            value: item.querySelector('.spec-value')?.textContent.trim(),
                        })),
                };
            });

            product.url = url;
            product.scrapedAt = new Date().toISOString();

            await Dataset.pushData(product);
        }
    },
});

await crawler.addRequests([
    { url: 'https://example-shop.com/products', label: 'LIST' },
]);

await crawler.run();

Handling Authentication and Sessions

PuppeteerCrawler makes it easy to handle authentication:

import { PuppeteerCrawler } from 'crawlee';

const crawler = new PuppeteerCrawler({
    preNavigationHooks: [
        async ({ page, request, session }) => {
            // Set cookies from session
            if (session?.userData?.cookies) {
                await page.setCookie(...session.userData.cookies);
            }
        },
    ],

    async requestHandler({ page, request, session }) {
        // Check if we need to login
        const isLoginPage = await page.$('#login-form');

        if (isLoginPage) {
            // Perform login
            await page.type('#username', 'your-username');
            await page.type('#password', 'your-password');
            await page.click('#login-button');
            await page.waitForNavigation();

            // Save cookies to session
            const cookies = await page.cookies();
            session.userData.cookies = cookies;
        }

        // Continue with regular scraping
        const data = await page.evaluate(() => ({
            title: document.title,
            content: document.body.innerText,
        }));

        await Dataset.pushData(data);
    },
});

Request Queue Management

Crawlee automatically manages the request queue, but you can control it explicitly:

import { PuppeteerCrawler, RequestQueue } from 'crawlee';

// Create or open a named request queue
const requestQueue = await RequestQueue.open('my-queue');

const crawler = new PuppeteerCrawler({
    requestQueue,

    async requestHandler({ page, request, crawler }) {
        // Add new requests programmatically
        await crawler.addRequests([
            { url: 'https://example.com/page1', label: 'PAGE' },
            { url: 'https://example.com/page2', label: 'PAGE' },
        ]);

        // Or use enqueueLinks helper
        await crawler.enqueueLinks({
            selector: 'a.product-link',
            label: 'PRODUCT',
            transformRequestFunction: (req) => {
                // Modify requests before adding to queue
                req.userData = { category: 'electronics' };
                return req;
            },
        });
    },
});

Performance Optimization

To optimize PuppeteerCrawler performance:

import { PuppeteerCrawler } from 'crawlee';

const crawler = new PuppeteerCrawler({
    // Control concurrency based on system resources
    maxConcurrency: 10,
    minConcurrency: 2,

    // Adjust request timeouts
    requestHandlerTimeoutSecs: 60,
    navigationTimeoutSecs: 30,

    launchContext: {
        launchOptions: {
            headless: true,
            // Reduce memory usage
            args: [
                '--no-sandbox',
                '--disable-setuid-sandbox',
                '--disable-dev-shm-usage',
                '--disable-accelerated-2d-canvas',
                '--no-first-run',
                '--no-zygote',
                '--disable-gpu',
            ],
        },
        // Use Chrome instead of Chromium for better performance
        useChrome: true,
    },

    // Block unnecessary resources
    preNavigationHooks: [
        async ({ page }) => {
            await page.setRequestInterception(true);
            page.on('request', (req) => {
                const resourceType = req.resourceType();
                if (resourceType === 'image' || resourceType === 'stylesheet' || resourceType === 'font') {
                    req.abort();
                } else {
                    req.continue();
                }
            });
        },
    ],
});

Handling Dynamic Content and AJAX

When working with AJAX requests and dynamic content:

import { PuppeteerCrawler } from 'crawlee';

const crawler = new PuppeteerCrawler({
    async requestHandler({ page, request }) {
        // Wait for AJAX content to load
        await page.waitForSelector('.ajax-loaded-content', {
            visible: true,
            timeout: 10000,
        });

        // Monitor network requests
        const responses = [];
        page.on('response', async (response) => {
            const url = response.url();
            if (url.includes('/api/')) {
                const data = await response.json().catch(() => null);
                if (data) responses.push(data);
            }
        });

        // Trigger AJAX by clicking a button
        await page.click('.load-data-button');

        // Wait for network to be idle
        await page.waitForNetworkIdle({ timeout: 5000 });

        // Extract data rendered by AJAX
        const dynamicData = await page.evaluate(() => {
            return Array.from(document.querySelectorAll('.dynamic-item'))
                .map(item => item.textContent.trim());
        });

        await Dataset.pushData({
            url: request.url,
            dynamicData,
            apiResponses: responses,
        });
    },
});

Error Handling and Retries

PuppeteerCrawler includes built-in retry mechanisms:

import { PuppeteerCrawler } from 'crawlee';

const crawler = new PuppeteerCrawler({
    // Maximum retries for failed requests
    maxRequestRetries: 3,

    async requestHandler({ page, request, log }) {
        try {
            await page.goto(request.url, { waitUntil: 'networkidle2' });

            // Your scraping logic
            const data = await page.evaluate(() => ({
                title: document.title,
            }));

            await Dataset.pushData(data);

        } catch (error) {
            log.error(`Error processing ${request.url}:`, error);
            throw error; // Re-throw to trigger retry
        }
    },

    async failedRequestHandler({ request, log }) {
        // This runs after all retries are exhausted
        log.error(`Request failed after ${request.retryCount} retries: ${request.url}`);

        // Save failed URLs for later review
        await Dataset.pushData({
            url: request.url,
            failed: true,
            error: request.errorMessages,
        });
    },
});

TypeScript Support

Crawlee has excellent TypeScript support:

import { PuppeteerCrawler, Dataset } from 'crawlee';
import type { Page } from 'puppeteer';

interface ProductData {
    name: string;
    price: number;
    url: string;
}

const crawler = new PuppeteerCrawler({
    async requestHandler({ page, request }): Promise<void> {
        const product: ProductData = await page.evaluate((): ProductData => {
            return {
                name: document.querySelector('.product-name')?.textContent?.trim() || '',
                price: parseFloat(document.querySelector('.price')?.textContent?.replace(/[^0-9.]/g, '') || '0'),
                url: window.location.href,
            };
        });

        await Dataset.pushData<ProductData>(product);
    },
});

Conclusion

PuppeteerCrawler in Crawlee provides a powerful, production-ready solution for browser automation and web scraping. It combines Puppeteer's browser control capabilities with Crawlee's robust infrastructure for queue management, request handling, and error recovery. Whether you're scraping simple websites or complex single-page applications, PuppeteerCrawler offers the flexibility and reliability needed for professional web scraping projects.

For simpler scraping tasks that don't require JavaScript execution, consider using Crawlee's CheerioCrawler for better performance. For even more modern browser automation with additional features, explore PlaywrightCrawler as an alternative to PuppeteerCrawler.

Table of contents

How do I use PuppeteerCrawler in Crawlee for browser automation?

What is PuppeteerCrawler?

Basic PuppeteerCrawler Setup

Core Configuration Options

Working with Page Interactions

Advanced Data Extraction

Handling Authentication and Sessions

Request Queue Management

Performance Optimization

Handling Dynamic Content and AJAX

Error Handling and Retries

TypeScript Support

Conclusion

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Related Questions

What is CheerioCrawler and when is it the best choice?

How do I choose between different crawler types in Crawlee?

What is JSDOMCrawler in Crawlee and when should I use it?

Get Started Now

Support