How do I Choose Between Different Crawler Types in Crawlee?
Crawlee offers three main crawler types—CheerioCrawler, PuppeteerCrawler, and PlaywrightCrawler—each designed for specific web scraping scenarios. Choosing the right crawler depends on factors like website complexity, JavaScript rendering requirements, performance needs, and resource constraints. This guide will help you understand the strengths and use cases of each crawler type to make an informed decision.
Understanding Crawlee's Crawler Types
CheerioCrawler: Lightweight HTML Parsing
CheerioCrawler is the fastest and most resource-efficient option in Crawlee. It uses the Cheerio library to parse static HTML content without rendering JavaScript, making it ideal for simple websites that don't rely heavily on client-side rendering.
Key characteristics: - Speed: Extremely fast—can handle thousands of requests per minute - Resource usage: Minimal CPU and memory consumption - JavaScript: Does not execute JavaScript - Best for: Static HTML websites, APIs, and server-rendered content
When to use CheerioCrawler: - The target website serves fully-rendered HTML from the server - Content is available in the initial HTML response (no dynamic loading) - You need maximum speed and efficiency - You're scraping large volumes of simple pages - The website doesn't use JavaScript for content rendering
Example:
import { CheerioCrawler } from 'crawlee';
const crawler = new CheerioCrawler({
async requestHandler({ request, $, enqueueLinks }) {
const title = $('title').text();
const articles = $('article h2').map((i, el) => $(el).text()).get();
console.log(`Title: ${title}`);
console.log(`Articles found: ${articles.length}`);
// Enqueue pagination links
await enqueueLinks({
selector: 'a.next-page',
});
},
});
await crawler.run(['https://example.com/blog']);
PuppeteerCrawler: Full Browser Automation with Chromium
PuppeteerCrawler uses Google's Puppeteer library to control a headless Chromium browser. It executes JavaScript, handles dynamic content, and can interact with web pages just like a real user.
Key characteristics: - Speed: Moderate—slower than Cheerio but faster than Playwright - Resource usage: High CPU and memory consumption (runs full browser) - JavaScript: Full JavaScript execution and DOM manipulation - Browser: Chromium only - Best for: JavaScript-heavy websites, SPAs, and sites requiring user interactions
When to use PuppeteerCrawler: - Content is loaded dynamically via JavaScript - You need to interact with the page (clicks, form submissions, scrolling) - The website uses AJAX requests to fetch data - You need to handle browser sessions or cookies - You're comfortable with Chromium-only support
Example:
import { PuppeteerCrawler } from 'crawlee';
const crawler = new PuppeteerCrawler({
async requestHandler({ request, page, enqueueLinks }) {
// Wait for dynamic content to load
await page.waitForSelector('.product-list');
// Scroll to trigger lazy loading
await page.evaluate(() => {
window.scrollTo(0, document.body.scrollHeight);
});
await page.waitForTimeout(2000);
// Extract data from JavaScript-rendered content
const products = await page.$$eval('.product-item', items => {
return items.map(item => ({
name: item.querySelector('h3')?.textContent,
price: item.querySelector('.price')?.textContent,
}));
});
console.log(`Found ${products.length} products`);
// Enqueue next pages
await enqueueLinks({
selector: 'a.pagination-link',
});
},
headless: true,
});
await crawler.run(['https://example.com/products']);
PlaywrightCrawler: Cross-Browser Automation
PlaywrightCrawler uses Microsoft's Playwright library, which supports multiple browser engines (Chromium, Firefox, and WebKit). It offers the most comprehensive browser automation capabilities.
Key characteristics: - Speed: Slowest due to comprehensive browser support - Resource usage: High CPU and memory consumption - JavaScript: Full JavaScript execution and DOM manipulation - Browsers: Chromium, Firefox, and WebKit (Safari) - Best for: Cross-browser testing, complex interactions, and maximum compatibility
When to use PlaywrightCrawler: - You need cross-browser compatibility testing - The website behaves differently across browsers - You need WebKit/Safari-specific features - You require advanced browser automation features - You want to handle AJAX requests across different browsers
Example:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
async requestHandler({ request, page, enqueueLinks }) {
// Handle authentication modal
const loginButton = await page.$('button.login');
if (loginButton) {
await loginButton.click();
await page.fill('input[name="email"]', 'user@example.com');
await page.fill('input[name="password"]', 'password');
await page.click('button[type="submit"]');
await page.waitForNavigation();
}
// Extract data after authentication
const data = await page.evaluate(() => {
return {
userInfo: document.querySelector('.user-profile')?.textContent,
notifications: Array.from(document.querySelectorAll('.notification'))
.map(n => n.textContent),
};
});
console.log(`User data:`, data);
},
launchContext: {
launcher: 'firefox', // Can use 'chromium', 'firefox', or 'webkit'
},
});
await crawler.run(['https://example.com/dashboard']);
Decision Matrix: Choosing the Right Crawler
Performance Comparison
| Crawler Type | Speed | Memory Usage | CPU Usage | JavaScript Support | |-------------|-------|--------------|-----------|-------------------| | CheerioCrawler | ⚡⚡⚡⚡⚡ | 💾 | 🔥 | ❌ | | PuppeteerCrawler | ⚡⚡⚡ | 💾💾💾💾 | 🔥🔥🔥🔥 | ✅ | | PlaywrightCrawler | ⚡⚡ | 💾💾💾💾💾 | 🔥🔥🔥🔥🔥 | ✅ |
Use Case Decision Tree
Does the website require JavaScript to render content?
├─ NO → Use CheerioCrawler
│ └─ Fastest, most efficient option
│
└─ YES → Does the website work only on specific browsers?
├─ NO → Use PuppeteerCrawler
│ └─ Good balance of features and performance
│
└─ YES → Use PlaywrightCrawler
└─ Cross-browser support
Python Equivalent with Crawlee for Python
Crawlee also has a Python version with similar crawler types. Here's how to choose between them in Python:
CheerioCrawler in Python (BeautifulSoupCrawler)
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler
async def main():
crawler = BeautifulSoupCrawler()
@crawler.router.default_handler
async def request_handler(context):
title = context.soup.find('title').text
articles = [h2.text for h2 in context.soup.select('article h2')]
print(f'Title: {title}')
print(f'Articles found: {len(articles)}')
await context.enqueue_links(selector='a.next-page')
await crawler.run(['https://example.com/blog'])
PuppeteerCrawler/PlaywrightCrawler in Python
from crawlee.playwright_crawler import PlaywrightCrawler
async def main():
crawler = PlaywrightCrawler()
@crawler.router.default_handler
async def request_handler(context):
page = context.page
# Wait for dynamic content
await page.wait_for_selector('.product-list')
# Extract data
products = await page.query_selector_all('.product-item')
product_data = []
for product in products:
name = await product.query_selector('h3')
price = await product.query_selector('.price')
product_data.append({
'name': await name.text_content() if name else None,
'price': await price.text_content() if price else None,
})
print(f'Found {len(product_data)} products')
await context.enqueue_links(selector='a.pagination-link')
await crawler.run(['https://example.com/products'])
Advanced Considerations
Hybrid Approach
You can use multiple crawler types in the same project. Start with CheerioCrawler for initial discovery, then switch to a browser-based crawler for JavaScript-heavy pages:
import { CheerioCrawler, PuppeteerCrawler } from 'crawlee';
const cheerioCrawler = new CheerioCrawler({
async requestHandler({ request, $, enqueueLinks }) {
// Extract links to product pages
const productLinks = $('a.product-link')
.map((i, el) => $(el).attr('href'))
.get();
// Add to browser crawler queue with different label
await enqueueLinks({
urls: productLinks,
label: 'PRODUCT_DETAIL',
});
},
});
const puppeteerCrawler = new PuppeteerCrawler({
async requestHandler({ request, page }) {
// Only process product detail pages
if (request.label === 'PRODUCT_DETAIL') {
await page.waitForSelector('.product-details');
const details = await page.evaluate(() => {
return {
description: document.querySelector('.description')?.textContent,
specifications: document.querySelector('.specs')?.innerHTML,
};
});
console.log('Product details:', details);
}
},
});
Resource Management
When using browser-based crawlers, configure resource limits to prevent system overload:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
maxConcurrency: 5, // Limit concurrent browser instances
requestHandlerTimeoutSecs: 60, // Prevent hanging requests
navigationTimeoutSecs: 30, // Timeout for page navigation
launchContext: {
launchOptions: {
headless: true,
args: [
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-dev-shm-usage', // Overcome limited resource problems
],
},
},
async requestHandler({ request, page }) {
// Your scraping logic
},
});
Practical Recommendations
Start Simple, Scale Up
- Test with CheerioCrawler first: Always try the lightweight option to see if it works
- Upgrade to PuppeteerCrawler: If you need JavaScript rendering
- Use PlaywrightCrawler: Only when you need cross-browser support or specific browser features
Monitor and Optimize
- Track execution time: Measure how long each crawler takes
- Monitor memory usage: Browser-based crawlers can consume significant RAM
- Adjust concurrency: Lower concurrency for resource-intensive crawlers
- Use request filtering: Don't load unnecessary resources (images, CSS) when not needed
Cost Considerations
Running browser-based crawlers (Puppeteer/Playwright) costs more in terms of: - Server resources: Higher CPU and memory requirements - Cloud hosting: More expensive instances needed - Execution time: Slower scraping means longer running jobs
CheerioCrawler can be 10-50x faster and use 90% less memory than browser-based alternatives.
Conclusion
Choosing the right Crawlee crawler type is crucial for building efficient web scrapers. Use CheerioCrawler for speed and efficiency with static HTML, PuppeteerCrawler for JavaScript-heavy websites with Chromium, and PlaywrightCrawler when you need cross-browser compatibility. Consider starting with the simplest option and upgrading only when necessary to balance performance with functionality.
For more advanced scenarios involving browser automation, you may also want to explore how to navigate to different pages using Puppeteer or learn about handling browser events for more complex interactions.