What is PlaywrightCrawler in Crawlee and when should I use it?

PlaywrightCrawler is one of the core crawler classes in Crawlee, a powerful web scraping and browser automation library. It leverages Playwright, a modern browser automation framework developed by Microsoft, to crawl and extract data from websites that require JavaScript rendering, complex interactions, or browser-like behavior.

Understanding PlaywrightCrawler

PlaywrightCrawler is designed for crawling web pages that cannot be scraped with simple HTTP requests. It launches real browser instances (Chromium, Firefox, or WebKit) to execute JavaScript, handle dynamic content, and interact with web pages just like a human user would.

Key Features

Full Browser Automation: Executes JavaScript and renders dynamic content
Multi-Browser Support: Works with Chromium, Firefox, and WebKit
Built-in Request Management: Automatic request queuing and retry logic
Smart Crawling: Intelligent link extraction and URL management
Session Management: Handles cookies, local storage, and authentication
Resource Optimization: Automatic browser instance pooling and management
Error Handling: Robust error recovery and retry mechanisms

Basic Usage

Here's a simple example of using PlaywrightCrawler to scrape a website:

JavaScript/TypeScript

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    // Maximum number of pages to crawl
    maxRequestsPerCrawl: 50,

    // Request handler - called for each page
    async requestHandler({ request, page, enqueueLinks, log }) {
        log.info(`Processing: ${request.url}`);

        // Wait for content to load
        await page.waitForSelector('.product-title');

        // Extract data
        const title = await page.$eval('.product-title', el => el.textContent);
        const price = await page.$eval('.product-price', el => el.textContent);

        log.info(`Found product: ${title} - ${price}`);

        // Save data
        await dataset.pushData({
            url: request.url,
            title,
            price,
        });

        // Enqueue additional links
        await enqueueLinks({
            selector: 'a.product-link',
            label: 'PRODUCT',
        });
    },

    // Handle failed requests
    async failedRequestHandler({ request, log }) {
        log.error(`Request ${request.url} failed too many times`);
    },
});

// Start crawling
await crawler.run(['https://example-shop.com/products']);

Python

from crawlee.playwright_crawler import PlaywrightCrawler
from crawlee import Dataset

async def main():
    crawler = PlaywrightCrawler(
        max_requests_per_crawl=50,
    )

    @crawler.router.default_handler
    async def request_handler(context):
        page = context.page
        log = context.log

        log.info(f'Processing: {context.request.url}')

        # Wait for content to load
        await page.wait_for_selector('.product-title')

        # Extract data
        title = await page.locator('.product-title').inner_text()
        price = await page.locator('.product-price').inner_text()

        log.info(f'Found product: {title} - {price}')

        # Save data
        await context.push_data({
            'url': context.request.url,
            'title': title,
            'price': price,
        })

        # Enqueue additional links
        await context.enqueue_links(
            selector='a.product-link',
            label='PRODUCT',
        )

    # Start crawling
    await crawler.run(['https://example-shop.com/products'])

When to Use PlaywrightCrawler

Use PlaywrightCrawler When:

1. JavaScript-Rendered Content

Modern single-page applications (SPAs) built with React, Vue, Angular, or other frameworks render content dynamically. If you need to scrape single page applications, PlaywrightCrawler is essential.

const crawler = new PlaywrightCrawler({
    async requestHandler({ page, log }) {
        // Wait for React/Vue app to render
        await page.waitForSelector('[data-testid="content-loaded"]');

        // Content is now fully rendered
        const data = await page.evaluate(() => {
            return window.__INITIAL_STATE__;
        });
    },
});

2. Complex User Interactions

When you need to click buttons, fill forms, scroll pages, or handle pop-ups and modals:

const crawler = new PlaywrightCrawler({
    async requestHandler({ page }) {
        // Click "Load More" button
        await page.click('button.load-more');

        // Wait for new content
        await page.waitForTimeout(2000);

        // Fill search form
        await page.fill('input[name="search"]', 'laptop');
        await page.click('button[type="submit"]');

        // Wait for results
        await page.waitForSelector('.search-results');
    },
});

3. AJAX Requests and Dynamic Loading

Websites that load data via AJAX requests after initial page load:

const crawler = new PlaywrightCrawler({
    async requestHandler({ page }) {
        // Wait for specific network request to complete
        await page.waitForResponse(
            response => response.url().includes('/api/products') && response.status() === 200
        );

        // Extract data after AJAX load
        const products = await page.$$eval('.product-item', items => {
            return items.map(item => ({
                name: item.querySelector('.name').textContent,
                price: item.querySelector('.price').textContent,
            }));
        });
    },
});

4. Authentication and Session Management

When you need to log in or maintain authenticated sessions:

const crawler = new PlaywrightCrawler({
    preNavigationHooks: [
        async ({ page, session }) => {
            // Login only once per session
            if (!session.userData.loggedIn) {
                await page.goto('https://example.com/login');
                await page.fill('input[name="email"]', 'user@example.com');
                await page.fill('input[name="password"]', 'password123');
                await page.click('button[type="submit"]');
                await page.waitForSelector('.user-dashboard');

                session.userData.loggedIn = true;
            }
        },
    ],
    async requestHandler({ page }) {
        // Now all requests are authenticated
        const userData = await page.$eval('.user-profile', el => el.textContent);
    },
});

5. Websites with Anti-Bot Detection

PlaywrightCrawler with Crawlee's built-in browser fingerprinting helps avoid detection:

const crawler = new PlaywrightCrawler({
    launchContext: {
        launchOptions: {
            headless: true,
        },
    },
    // Crawlee automatically handles fingerprinting
    useSessionPool: true,
    persistCookiesPerSession: true,
});

When NOT to Use PlaywrightCrawler

Use CheerioCrawler Instead When:

Static HTML Content: If the website serves all content in the initial HTML response
High-Speed Scraping: When you need to scrape thousands of pages quickly
Resource Constraints: When running on limited memory or CPU
Simple Data Extraction: When basic CSS selectors or XPath are sufficient

Example with CheerioCrawler for simple static pages:

import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ $, request }) {
        // Much faster for static content
        const title = $('h1').text();
        const links = $('a').map((i, el) => $(el).attr('href')).get();
    },
});

Advanced PlaywrightCrawler Configuration

Browser Selection

Choose between different browser engines:

const crawler = new PlaywrightCrawler({
    launchContext: {
        // Use Firefox instead of default Chromium
        launcher: firefoxLauncher,
        // Or WebKit
        // launcher: webkitLauncher,
    },
});

Request Interception and Blocking

Optimize performance by blocking unnecessary resources:

const crawler = new PlaywrightCrawler({
    preNavigationHooks: [
        async ({ page }) => {
            // Block images, stylesheets, and fonts
            await page.route('**/*', (route) => {
                const resourceType = route.request().resourceType();
                if (['image', 'stylesheet', 'font'].includes(resourceType)) {
                    route.abort();
                } else {
                    route.continue();
                }
            });
        },
    ],
});

Concurrent Crawling

Control how many browser instances run simultaneously:

const crawler = new PlaywrightCrawler({
    // Maximum concurrent browser instances
    maxConcurrency: 10,

    // Minimum and maximum requests per browser instance
    minConcurrency: 1,
    maxConcurrency: 5,
});

Custom Browser Context Options

Configure browser behavior:

const crawler = new PlaywrightCrawler({
    launchContext: {
        launchOptions: {
            headless: true,
            args: ['--no-sandbox', '--disable-setuid-sandbox'],
        },
    },
    browserPoolOptions: {
        useFingerprints: true,
        fingerprintOptions: {
            fingerprintGeneratorOptions: {
                devices: ['mobile'],
                operatingSystems: ['android'],
            },
        },
    },
});

Performance Considerations

Memory Usage

PlaywrightCrawler is memory-intensive because it runs full browser instances. Monitor and limit resource usage:

const crawler = new PlaywrightCrawler({
    maxConcurrency: 3, // Limit concurrent browsers

    // Close browser context after processing
    async postNavigationHooks([
        async ({ page }) => {
            await page.close();
        },
    ]),
});

Speed Optimization

const crawler = new PlaywrightCrawler({
    // Reuse browser contexts
    useSessionPool: true,

    // Disable unnecessary features
    preNavigationHooks: [
        async ({ page }) => {
            // Disable images and CSS
            await page.route('**/*.{png,jpg,jpeg,gif,svg,css}', route => route.abort());
        },
    ],
});

Error Handling and Retries

PlaywrightCrawler includes built-in retry logic:

const crawler = new PlaywrightCrawler({
    maxRequestRetries: 3,

    async failedRequestHandler({ request, log, error }) {
        log.error(`Request ${request.url} failed: ${error.message}`);

        // Custom error handling logic
        if (error.message.includes('timeout')) {
            // Maybe increase timeout for this URL
        }
    },

    requestHandlerTimeoutSecs: 60, // Timeout for each request
});

Comparison with Other Crawlers

| Feature | PlaywrightCrawler | CheerioCrawler | PuppeteerCrawler | |---------|------------------|----------------|------------------| | JavaScript Execution | ✅ Yes | ❌ No | ✅ Yes | | Speed | ⚡ Moderate | ⚡⚡⚡ Fast | ⚡ Moderate | | Memory Usage | 💾💾💾 High | 💾 Low | 💾💾💾 High | | Browser Support | Chrome, Firefox, WebKit | N/A | Chrome only | | Best For | Modern SPAs, Complex sites | Static HTML | Chrome-specific features |

Conclusion

PlaywrightCrawler is the ideal choice when you need full browser automation capabilities, JavaScript execution, or complex user interactions. It's particularly well-suited for:

Single-page applications and modern JavaScript frameworks
Websites with dynamic content loading
Complex authentication flows
Sites requiring user interactions (clicks, scrolling, form submissions)
Situations where you need cross-browser compatibility

However, if you're scraping static HTML content or need maximum speed and efficiency, consider using CheerioCrawler instead. The choice between crawlers should be based on the specific requirements of your scraping project, balancing functionality needs with resource constraints.

For production deployments, always test your crawler configuration with realistic traffic patterns and monitor resource usage to optimize performance and costs.

Table of contents