What is PlaywrightCrawler in Crawlee and when should I use it?
PlaywrightCrawler is one of the core crawler classes in Crawlee, a powerful web scraping and browser automation library. It leverages Playwright, a modern browser automation framework developed by Microsoft, to crawl and extract data from websites that require JavaScript rendering, complex interactions, or browser-like behavior.
Understanding PlaywrightCrawler
PlaywrightCrawler is designed for crawling web pages that cannot be scraped with simple HTTP requests. It launches real browser instances (Chromium, Firefox, or WebKit) to execute JavaScript, handle dynamic content, and interact with web pages just like a human user would.
Key Features
- Full Browser Automation: Executes JavaScript and renders dynamic content
- Multi-Browser Support: Works with Chromium, Firefox, and WebKit
- Built-in Request Management: Automatic request queuing and retry logic
- Smart Crawling: Intelligent link extraction and URL management
- Session Management: Handles cookies, local storage, and authentication
- Resource Optimization: Automatic browser instance pooling and management
- Error Handling: Robust error recovery and retry mechanisms
Basic Usage
Here's a simple example of using PlaywrightCrawler to scrape a website:
JavaScript/TypeScript
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
// Maximum number of pages to crawl
maxRequestsPerCrawl: 50,
// Request handler - called for each page
async requestHandler({ request, page, enqueueLinks, log }) {
log.info(`Processing: ${request.url}`);
// Wait for content to load
await page.waitForSelector('.product-title');
// Extract data
const title = await page.$eval('.product-title', el => el.textContent);
const price = await page.$eval('.product-price', el => el.textContent);
log.info(`Found product: ${title} - ${price}`);
// Save data
await dataset.pushData({
url: request.url,
title,
price,
});
// Enqueue additional links
await enqueueLinks({
selector: 'a.product-link',
label: 'PRODUCT',
});
},
// Handle failed requests
async failedRequestHandler({ request, log }) {
log.error(`Request ${request.url} failed too many times`);
},
});
// Start crawling
await crawler.run(['https://example-shop.com/products']);
Python
from crawlee.playwright_crawler import PlaywrightCrawler
from crawlee import Dataset
async def main():
crawler = PlaywrightCrawler(
max_requests_per_crawl=50,
)
@crawler.router.default_handler
async def request_handler(context):
page = context.page
log = context.log
log.info(f'Processing: {context.request.url}')
# Wait for content to load
await page.wait_for_selector('.product-title')
# Extract data
title = await page.locator('.product-title').inner_text()
price = await page.locator('.product-price').inner_text()
log.info(f'Found product: {title} - {price}')
# Save data
await context.push_data({
'url': context.request.url,
'title': title,
'price': price,
})
# Enqueue additional links
await context.enqueue_links(
selector='a.product-link',
label='PRODUCT',
)
# Start crawling
await crawler.run(['https://example-shop.com/products'])
When to Use PlaywrightCrawler
Use PlaywrightCrawler When:
1. JavaScript-Rendered Content
Modern single-page applications (SPAs) built with React, Vue, Angular, or other frameworks render content dynamically. If you need to scrape single page applications, PlaywrightCrawler is essential.
const crawler = new PlaywrightCrawler({
async requestHandler({ page, log }) {
// Wait for React/Vue app to render
await page.waitForSelector('[data-testid="content-loaded"]');
// Content is now fully rendered
const data = await page.evaluate(() => {
return window.__INITIAL_STATE__;
});
},
});
2. Complex User Interactions
When you need to click buttons, fill forms, scroll pages, or handle pop-ups and modals:
const crawler = new PlaywrightCrawler({
async requestHandler({ page }) {
// Click "Load More" button
await page.click('button.load-more');
// Wait for new content
await page.waitForTimeout(2000);
// Fill search form
await page.fill('input[name="search"]', 'laptop');
await page.click('button[type="submit"]');
// Wait for results
await page.waitForSelector('.search-results');
},
});
3. AJAX Requests and Dynamic Loading
Websites that load data via AJAX requests after initial page load:
const crawler = new PlaywrightCrawler({
async requestHandler({ page }) {
// Wait for specific network request to complete
await page.waitForResponse(
response => response.url().includes('/api/products') && response.status() === 200
);
// Extract data after AJAX load
const products = await page.$$eval('.product-item', items => {
return items.map(item => ({
name: item.querySelector('.name').textContent,
price: item.querySelector('.price').textContent,
}));
});
},
});
4. Authentication and Session Management
When you need to log in or maintain authenticated sessions:
const crawler = new PlaywrightCrawler({
preNavigationHooks: [
async ({ page, session }) => {
// Login only once per session
if (!session.userData.loggedIn) {
await page.goto('https://example.com/login');
await page.fill('input[name="email"]', 'user@example.com');
await page.fill('input[name="password"]', 'password123');
await page.click('button[type="submit"]');
await page.waitForSelector('.user-dashboard');
session.userData.loggedIn = true;
}
},
],
async requestHandler({ page }) {
// Now all requests are authenticated
const userData = await page.$eval('.user-profile', el => el.textContent);
},
});
5. Websites with Anti-Bot Detection
PlaywrightCrawler with Crawlee's built-in browser fingerprinting helps avoid detection:
const crawler = new PlaywrightCrawler({
launchContext: {
launchOptions: {
headless: true,
},
},
// Crawlee automatically handles fingerprinting
useSessionPool: true,
persistCookiesPerSession: true,
});
When NOT to Use PlaywrightCrawler
Use CheerioCrawler Instead When:
- Static HTML Content: If the website serves all content in the initial HTML response
- High-Speed Scraping: When you need to scrape thousands of pages quickly
- Resource Constraints: When running on limited memory or CPU
- Simple Data Extraction: When basic CSS selectors or XPath are sufficient
Example with CheerioCrawler for simple static pages:
import { CheerioCrawler } from 'crawlee';
const crawler = new CheerioCrawler({
async requestHandler({ $, request }) {
// Much faster for static content
const title = $('h1').text();
const links = $('a').map((i, el) => $(el).attr('href')).get();
},
});
Advanced PlaywrightCrawler Configuration
Browser Selection
Choose between different browser engines:
const crawler = new PlaywrightCrawler({
launchContext: {
// Use Firefox instead of default Chromium
launcher: firefoxLauncher,
// Or WebKit
// launcher: webkitLauncher,
},
});
Request Interception and Blocking
Optimize performance by blocking unnecessary resources:
const crawler = new PlaywrightCrawler({
preNavigationHooks: [
async ({ page }) => {
// Block images, stylesheets, and fonts
await page.route('**/*', (route) => {
const resourceType = route.request().resourceType();
if (['image', 'stylesheet', 'font'].includes(resourceType)) {
route.abort();
} else {
route.continue();
}
});
},
],
});
Concurrent Crawling
Control how many browser instances run simultaneously:
const crawler = new PlaywrightCrawler({
// Maximum concurrent browser instances
maxConcurrency: 10,
// Minimum and maximum requests per browser instance
minConcurrency: 1,
maxConcurrency: 5,
});
Custom Browser Context Options
Configure browser behavior:
const crawler = new PlaywrightCrawler({
launchContext: {
launchOptions: {
headless: true,
args: ['--no-sandbox', '--disable-setuid-sandbox'],
},
},
browserPoolOptions: {
useFingerprints: true,
fingerprintOptions: {
fingerprintGeneratorOptions: {
devices: ['mobile'],
operatingSystems: ['android'],
},
},
},
});
Performance Considerations
Memory Usage
PlaywrightCrawler is memory-intensive because it runs full browser instances. Monitor and limit resource usage:
const crawler = new PlaywrightCrawler({
maxConcurrency: 3, // Limit concurrent browsers
// Close browser context after processing
async postNavigationHooks([
async ({ page }) => {
await page.close();
},
]),
});
Speed Optimization
const crawler = new PlaywrightCrawler({
// Reuse browser contexts
useSessionPool: true,
// Disable unnecessary features
preNavigationHooks: [
async ({ page }) => {
// Disable images and CSS
await page.route('**/*.{png,jpg,jpeg,gif,svg,css}', route => route.abort());
},
],
});
Error Handling and Retries
PlaywrightCrawler includes built-in retry logic:
const crawler = new PlaywrightCrawler({
maxRequestRetries: 3,
async failedRequestHandler({ request, log, error }) {
log.error(`Request ${request.url} failed: ${error.message}`);
// Custom error handling logic
if (error.message.includes('timeout')) {
// Maybe increase timeout for this URL
}
},
requestHandlerTimeoutSecs: 60, // Timeout for each request
});
Comparison with Other Crawlers
| Feature | PlaywrightCrawler | CheerioCrawler | PuppeteerCrawler | |---------|------------------|----------------|------------------| | JavaScript Execution | ✅ Yes | ❌ No | ✅ Yes | | Speed | ⚡ Moderate | ⚡⚡⚡ Fast | ⚡ Moderate | | Memory Usage | 💾💾💾 High | 💾 Low | 💾💾💾 High | | Browser Support | Chrome, Firefox, WebKit | N/A | Chrome only | | Best For | Modern SPAs, Complex sites | Static HTML | Chrome-specific features |
Conclusion
PlaywrightCrawler is the ideal choice when you need full browser automation capabilities, JavaScript execution, or complex user interactions. It's particularly well-suited for:
- Single-page applications and modern JavaScript frameworks
- Websites with dynamic content loading
- Complex authentication flows
- Sites requiring user interactions (clicks, scrolling, form submissions)
- Situations where you need cross-browser compatibility
However, if you're scraping static HTML content or need maximum speed and efficiency, consider using CheerioCrawler instead. The choice between crawlers should be based on the specific requirements of your scraping project, balancing functionality needs with resource constraints.
For production deployments, always test your crawler configuration with realistic traffic patterns and monitor resource usage to optimize performance and costs.