Table of contents

What is PlaywrightCrawler in Crawlee and when should I use it?

PlaywrightCrawler is one of the core crawler classes in Crawlee, a powerful web scraping and browser automation library. It leverages Playwright, a modern browser automation framework developed by Microsoft, to crawl and extract data from websites that require JavaScript rendering, complex interactions, or browser-like behavior.

Understanding PlaywrightCrawler

PlaywrightCrawler is designed for crawling web pages that cannot be scraped with simple HTTP requests. It launches real browser instances (Chromium, Firefox, or WebKit) to execute JavaScript, handle dynamic content, and interact with web pages just like a human user would.

Key Features

  1. Full Browser Automation: Executes JavaScript and renders dynamic content
  2. Multi-Browser Support: Works with Chromium, Firefox, and WebKit
  3. Built-in Request Management: Automatic request queuing and retry logic
  4. Smart Crawling: Intelligent link extraction and URL management
  5. Session Management: Handles cookies, local storage, and authentication
  6. Resource Optimization: Automatic browser instance pooling and management
  7. Error Handling: Robust error recovery and retry mechanisms

Basic Usage

Here's a simple example of using PlaywrightCrawler to scrape a website:

JavaScript/TypeScript

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    // Maximum number of pages to crawl
    maxRequestsPerCrawl: 50,

    // Request handler - called for each page
    async requestHandler({ request, page, enqueueLinks, log }) {
        log.info(`Processing: ${request.url}`);

        // Wait for content to load
        await page.waitForSelector('.product-title');

        // Extract data
        const title = await page.$eval('.product-title', el => el.textContent);
        const price = await page.$eval('.product-price', el => el.textContent);

        log.info(`Found product: ${title} - ${price}`);

        // Save data
        await dataset.pushData({
            url: request.url,
            title,
            price,
        });

        // Enqueue additional links
        await enqueueLinks({
            selector: 'a.product-link',
            label: 'PRODUCT',
        });
    },

    // Handle failed requests
    async failedRequestHandler({ request, log }) {
        log.error(`Request ${request.url} failed too many times`);
    },
});

// Start crawling
await crawler.run(['https://example-shop.com/products']);

Python

from crawlee.playwright_crawler import PlaywrightCrawler
from crawlee import Dataset

async def main():
    crawler = PlaywrightCrawler(
        max_requests_per_crawl=50,
    )

    @crawler.router.default_handler
    async def request_handler(context):
        page = context.page
        log = context.log

        log.info(f'Processing: {context.request.url}')

        # Wait for content to load
        await page.wait_for_selector('.product-title')

        # Extract data
        title = await page.locator('.product-title').inner_text()
        price = await page.locator('.product-price').inner_text()

        log.info(f'Found product: {title} - {price}')

        # Save data
        await context.push_data({
            'url': context.request.url,
            'title': title,
            'price': price,
        })

        # Enqueue additional links
        await context.enqueue_links(
            selector='a.product-link',
            label='PRODUCT',
        )

    # Start crawling
    await crawler.run(['https://example-shop.com/products'])

When to Use PlaywrightCrawler

Use PlaywrightCrawler When:

1. JavaScript-Rendered Content

Modern single-page applications (SPAs) built with React, Vue, Angular, or other frameworks render content dynamically. If you need to scrape single page applications, PlaywrightCrawler is essential.

const crawler = new PlaywrightCrawler({
    async requestHandler({ page, log }) {
        // Wait for React/Vue app to render
        await page.waitForSelector('[data-testid="content-loaded"]');

        // Content is now fully rendered
        const data = await page.evaluate(() => {
            return window.__INITIAL_STATE__;
        });
    },
});

2. Complex User Interactions

When you need to click buttons, fill forms, scroll pages, or handle pop-ups and modals:

const crawler = new PlaywrightCrawler({
    async requestHandler({ page }) {
        // Click "Load More" button
        await page.click('button.load-more');

        // Wait for new content
        await page.waitForTimeout(2000);

        // Fill search form
        await page.fill('input[name="search"]', 'laptop');
        await page.click('button[type="submit"]');

        // Wait for results
        await page.waitForSelector('.search-results');
    },
});

3. AJAX Requests and Dynamic Loading

Websites that load data via AJAX requests after initial page load:

const crawler = new PlaywrightCrawler({
    async requestHandler({ page }) {
        // Wait for specific network request to complete
        await page.waitForResponse(
            response => response.url().includes('/api/products') && response.status() === 200
        );

        // Extract data after AJAX load
        const products = await page.$$eval('.product-item', items => {
            return items.map(item => ({
                name: item.querySelector('.name').textContent,
                price: item.querySelector('.price').textContent,
            }));
        });
    },
});

4. Authentication and Session Management

When you need to log in or maintain authenticated sessions:

const crawler = new PlaywrightCrawler({
    preNavigationHooks: [
        async ({ page, session }) => {
            // Login only once per session
            if (!session.userData.loggedIn) {
                await page.goto('https://example.com/login');
                await page.fill('input[name="email"]', 'user@example.com');
                await page.fill('input[name="password"]', 'password123');
                await page.click('button[type="submit"]');
                await page.waitForSelector('.user-dashboard');

                session.userData.loggedIn = true;
            }
        },
    ],
    async requestHandler({ page }) {
        // Now all requests are authenticated
        const userData = await page.$eval('.user-profile', el => el.textContent);
    },
});

5. Websites with Anti-Bot Detection

PlaywrightCrawler with Crawlee's built-in browser fingerprinting helps avoid detection:

const crawler = new PlaywrightCrawler({
    launchContext: {
        launchOptions: {
            headless: true,
        },
    },
    // Crawlee automatically handles fingerprinting
    useSessionPool: true,
    persistCookiesPerSession: true,
});

When NOT to Use PlaywrightCrawler

Use CheerioCrawler Instead When:

  1. Static HTML Content: If the website serves all content in the initial HTML response
  2. High-Speed Scraping: When you need to scrape thousands of pages quickly
  3. Resource Constraints: When running on limited memory or CPU
  4. Simple Data Extraction: When basic CSS selectors or XPath are sufficient

Example with CheerioCrawler for simple static pages:

import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ $, request }) {
        // Much faster for static content
        const title = $('h1').text();
        const links = $('a').map((i, el) => $(el).attr('href')).get();
    },
});

Advanced PlaywrightCrawler Configuration

Browser Selection

Choose between different browser engines:

const crawler = new PlaywrightCrawler({
    launchContext: {
        // Use Firefox instead of default Chromium
        launcher: firefoxLauncher,
        // Or WebKit
        // launcher: webkitLauncher,
    },
});

Request Interception and Blocking

Optimize performance by blocking unnecessary resources:

const crawler = new PlaywrightCrawler({
    preNavigationHooks: [
        async ({ page }) => {
            // Block images, stylesheets, and fonts
            await page.route('**/*', (route) => {
                const resourceType = route.request().resourceType();
                if (['image', 'stylesheet', 'font'].includes(resourceType)) {
                    route.abort();
                } else {
                    route.continue();
                }
            });
        },
    ],
});

Concurrent Crawling

Control how many browser instances run simultaneously:

const crawler = new PlaywrightCrawler({
    // Maximum concurrent browser instances
    maxConcurrency: 10,

    // Minimum and maximum requests per browser instance
    minConcurrency: 1,
    maxConcurrency: 5,
});

Custom Browser Context Options

Configure browser behavior:

const crawler = new PlaywrightCrawler({
    launchContext: {
        launchOptions: {
            headless: true,
            args: ['--no-sandbox', '--disable-setuid-sandbox'],
        },
    },
    browserPoolOptions: {
        useFingerprints: true,
        fingerprintOptions: {
            fingerprintGeneratorOptions: {
                devices: ['mobile'],
                operatingSystems: ['android'],
            },
        },
    },
});

Performance Considerations

Memory Usage

PlaywrightCrawler is memory-intensive because it runs full browser instances. Monitor and limit resource usage:

const crawler = new PlaywrightCrawler({
    maxConcurrency: 3, // Limit concurrent browsers

    // Close browser context after processing
    async postNavigationHooks([
        async ({ page }) => {
            await page.close();
        },
    ]),
});

Speed Optimization

const crawler = new PlaywrightCrawler({
    // Reuse browser contexts
    useSessionPool: true,

    // Disable unnecessary features
    preNavigationHooks: [
        async ({ page }) => {
            // Disable images and CSS
            await page.route('**/*.{png,jpg,jpeg,gif,svg,css}', route => route.abort());
        },
    ],
});

Error Handling and Retries

PlaywrightCrawler includes built-in retry logic:

const crawler = new PlaywrightCrawler({
    maxRequestRetries: 3,

    async failedRequestHandler({ request, log, error }) {
        log.error(`Request ${request.url} failed: ${error.message}`);

        // Custom error handling logic
        if (error.message.includes('timeout')) {
            // Maybe increase timeout for this URL
        }
    },

    requestHandlerTimeoutSecs: 60, // Timeout for each request
});

Comparison with Other Crawlers

| Feature | PlaywrightCrawler | CheerioCrawler | PuppeteerCrawler | |---------|------------------|----------------|------------------| | JavaScript Execution | ✅ Yes | ❌ No | ✅ Yes | | Speed | ⚡ Moderate | ⚡⚡⚡ Fast | ⚡ Moderate | | Memory Usage | 💾💾💾 High | 💾 Low | 💾💾💾 High | | Browser Support | Chrome, Firefox, WebKit | N/A | Chrome only | | Best For | Modern SPAs, Complex sites | Static HTML | Chrome-specific features |

Conclusion

PlaywrightCrawler is the ideal choice when you need full browser automation capabilities, JavaScript execution, or complex user interactions. It's particularly well-suited for:

  • Single-page applications and modern JavaScript frameworks
  • Websites with dynamic content loading
  • Complex authentication flows
  • Sites requiring user interactions (clicks, scrolling, form submissions)
  • Situations where you need cross-browser compatibility

However, if you're scraping static HTML content or need maximum speed and efficiency, consider using CheerioCrawler instead. The choice between crawlers should be based on the specific requirements of your scraping project, balancing functionality needs with resource constraints.

For production deployments, always test your crawler configuration with realistic traffic patterns and monitor resource usage to optimize performance and costs.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon