Are There Any Good Crawlee Tutorials for Beginners?
Yes, there are several excellent Crawlee tutorials and learning resources available for beginners. Whether you're new to web scraping or transitioning from other frameworks, Crawlee offers comprehensive documentation, official guides, and community resources to help you get started quickly.
Official Crawlee Documentation and Tutorials
The best place to start learning Crawlee is the official Crawlee documentation at crawlee.dev. The documentation is well-structured, beginner-friendly, and includes practical examples for both JavaScript/TypeScript and Python implementations.
Getting Started Guide
The official Getting Started guide walks you through:
- Installation and Setup: Installing Crawlee via npm or pip
- First Crawler: Building your first web scraper
- Core Concepts: Understanding crawlers, request queues, and data storage
- Best Practices: Following recommended patterns from the start
Here's a simple example from the official tutorial for JavaScript:
import { PlaywrightCrawler, Dataset } from 'crawlee';
// Create a PlaywrightCrawler instance
const crawler = new PlaywrightCrawler({
// Handle each request with this function
async requestHandler({ request, page, enqueueLinks, log }) {
log.info(`Processing ${request.url}...`);
// Extract the page title
const title = await page.title();
// Save data to the default dataset
await Dataset.pushData({
url: request.url,
title,
});
// Find and enqueue all links on the page
await enqueueLinks();
},
// Set maximum concurrency
maxConcurrency: 10,
});
// Start the crawler with initial URLs
await crawler.run(['https://example.com']);
And the equivalent Python version:
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
async def main():
# Create a PlaywrightCrawler instance
crawler = PlaywrightCrawler(
# Set maximum concurrency
max_requests_per_crawl=100,
)
# Define the request handler
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url}...')
# Extract the page title
title = await context.page.title()
# Save data to the default dataset
await context.push_data({
'url': context.request.url,
'title': title,
})
# Find and enqueue all links on the page
await context.enqueue_links()
# Start the crawler with initial URLs
await crawler.run(['https://example.com'])
if __name__ == '__main__':
import asyncio
asyncio.run(main)()
Step-by-Step Tutorial: Building Your First Crawler
Let's walk through a complete beginner tutorial for scraping a website with Crawlee.
Step 1: Installation
First, install Crawlee in your project:
For JavaScript/TypeScript:
npm install crawlee playwright
For Python:
pip install crawlee[playwright]
playwright install
Step 2: Choose Your Crawler Type
Crawlee offers different crawler types depending on your needs:
- CheerioCrawler: Fast, lightweight, for static HTML pages
- PlaywrightCrawler: Full browser automation, handles JavaScript-rendered content
- PuppeteerCrawler: Similar to PlaywrightCrawler but uses Puppeteer
- JSDOMCrawler: Server-side JavaScript execution without a full browser
For beginners, CheerioCrawler is great for simple scraping tasks, while PlaywrightCrawler is better when you need to handle AJAX requests or interact with dynamic content.
Step 3: Create a Simple Scraper
Here's a practical example that scrapes product information:
import { CheerioCrawler, Dataset } from 'crawlee';
const crawler = new CheerioCrawler({
async requestHandler({ request, $, enqueueLinks, log }) {
log.info(`Scraping: ${request.url}`);
// Check if this is a product page
if (request.label === 'PRODUCT') {
const title = $('h1.product-title').text().trim();
const price = $('.product-price').text().trim();
const description = $('.product-description').text().trim();
// Save the extracted data
await Dataset.pushData({
url: request.url,
title,
price,
description,
});
} else {
// Enqueue product links
await enqueueLinks({
selector: 'a.product-link',
label: 'PRODUCT',
});
}
},
maxRequestsPerCrawl: 50,
});
await crawler.run(['https://example-store.com/products']);
Step 4: Handle Pagination
Most real-world scenarios require handling pagination. Here's how:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
async requestHandler({ request, page, enqueueLinks, log }) {
log.info(`Processing page: ${request.url}`);
// Extract data from current page
const products = await page.$$eval('.product-item', items => {
return items.map(item => ({
name: item.querySelector('.product-name')?.textContent?.trim(),
price: item.querySelector('.product-price')?.textContent?.trim(),
}));
});
await Dataset.pushData(products);
// Find and click the next page button
await enqueueLinks({
selector: 'a.pagination-next',
label: 'LIST',
});
},
});
await crawler.run(['https://example.com/products?page=1']);
Advanced Crawlee Tutorial Topics
Once you've mastered the basics, explore these intermediate topics:
Request Queue Management
Crawlee's RequestQueue helps you manage URLs efficiently:
import { PlaywrightCrawler, RequestQueue } from 'crawlee';
const requestQueue = await RequestQueue.open();
// Add requests with custom data
await requestQueue.addRequest({
url: 'https://example.com/product/123',
userData: {
category: 'electronics',
priority: 'high',
},
});
const crawler = new PlaywrightCrawler({
requestQueue,
async requestHandler({ request, page, log }) {
log.info(`Category: ${request.userData.category}`);
// Process request...
},
});
await crawler.run();
Session Management and Proxies
For scraping websites that require handling authentication or using proxies:
import { PlaywrightCrawler, ProxyConfiguration } from 'crawlee';
const proxyConfiguration = new ProxyConfiguration({
proxyUrls: [
'http://proxy1.example.com:8000',
'http://proxy2.example.com:8000',
],
});
const crawler = new PlaywrightCrawler({
proxyConfiguration,
useSessionPool: true,
persistCookiesPerSession: true,
async requestHandler({ request, page, session, log }) {
log.info(`Using session: ${session.id}`);
// Your scraping logic...
},
});
Error Handling and Retries
Crawlee automatically retries failed requests, but you can customize this behavior:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
maxRequestRetries: 5,
maxRequestsPerCrawl: 1000,
async requestHandler({ request, page, log }) {
try {
await page.waitForSelector('.content', { timeout: 10000 });
// Extract data...
} catch (error) {
log.error(`Failed to scrape ${request.url}: ${error.message}`);
throw error; // This will trigger a retry
}
},
async failedRequestHandler({ request, log }) {
log.error(`Request ${request.url} failed too many times.`);
},
});
Community Resources and Video Tutorials
Beyond the official documentation, several community resources can help beginners:
YouTube Tutorials
Search for "Crawlee tutorial" on YouTube to find video walkthroughs. Look for recent tutorials (2023 or later) to ensure they cover the latest version.
GitHub Examples
The Crawlee GitHub repository contains an examples
folder with numerous real-world use cases:
- E-commerce scraping
- News aggregation
- Social media data extraction
- API integration examples
Apify Academy
Since Crawlee is developed by Apify, their Apify Academy offers free courses on web scraping fundamentals that apply directly to Crawlee.
Practical Project Ideas for Learning
The best way to learn Crawlee is through hands-on projects. Here are beginner-friendly ideas:
- Job Board Scraper: Extract job listings with titles, companies, and descriptions
- Product Price Monitor: Track prices across multiple e-commerce sites
- News Aggregator: Collect articles from various news websites
- Real Estate Listings: Scrape property information and prices
- Social Media Profile Scraper: Extract public profile information
Comparing Crawlee with Other Frameworks
As a beginner, it helps to understand when to use Crawlee versus other tools:
| Feature | Crawlee | Scrapy | Puppeteer/Playwright | |---------|---------|--------|----------------------| | JavaScript Support | ✅ Native | ❌ No | ✅ Native | | Python Support | ✅ Yes | ✅ Native | ⚠️ Limited | | Browser Automation | ✅ Built-in | ⚠️ Via plugins | ✅ Native | | Request Queue | ✅ Advanced | ✅ Built-in | ❌ Manual | | Auto-scaling | ✅ Yes | ⚠️ Limited | ❌ No | | Learning Curve | 🟢 Easy | 🟡 Moderate | 🟡 Moderate |
Crawlee shines when you need both simple HTTP scraping and complex browser automation in the same framework.
Best Practices for Crawlee Beginners
Follow these tips to avoid common pitfalls:
- Start with CheerioCrawler: Use the lightest crawler that works for your use case
- Use Request Labels: Organize different page types with labels
- Implement Rate Limiting: Respect target websites with
maxConcurrency
andminConcurrency
- Store Data Incrementally: Use
Dataset.pushData()
frequently to avoid data loss - Test with Small Crawls: Set
maxRequestsPerCrawl
low during development - Monitor Your Crawlers: Use the built-in logging to understand crawler behavior
Troubleshooting Common Beginner Issues
Issue: Crawler Doesn't Find Elements
// Bad: Not waiting for content to load
const title = await page.$eval('.title', el => el.textContent);
// Good: Wait for element before accessing
await page.waitForSelector('.title', { timeout: 5000 });
const title = await page.$eval('.title', el => el.textContent);
Issue: Too Many Concurrent Requests
// Configure concurrency appropriately
const crawler = new PlaywrightCrawler({
maxConcurrency: 5, // Start conservative
minConcurrency: 1,
maxRequestsPerMinute: 60,
});
Issue: Memory Problems with Large Crawls
// Use streaming with large datasets
const dataset = await Dataset.open();
await dataset.pushData(data);
// Export periodically
await dataset.exportToJSON('output');
Next Steps After Basic Tutorials
Once you've completed beginner tutorials, explore:
- TypeScript Integration: Add type safety to your crawlers
- Cloud Deployment: Run crawlers on Apify platform or AWS
- Advanced Selectors: Master CSS selectors and XPath
- Custom Storage: Implement MongoDB or PostgreSQL integration
- Monitoring and Alerting: Set up crawler health monitoring
Conclusion
Crawlee offers excellent tutorials and documentation for beginners, making it one of the most accessible web scraping frameworks available. Start with the official documentation, build simple projects, and gradually explore advanced features. The combination of comprehensive guides, practical examples, and active community support makes Crawlee an ideal choice for anyone learning web scraping.
Whether you're scraping single page applications or traditional websites, Crawlee provides the tools and tutorials you need to succeed. Begin with simple CheerioCrawler examples, progress to browser automation with PlaywrightCrawler, and soon you'll be building production-ready web scrapers with confidence.