Where Can I Find Crawlee Documentation and Examples?
Crawlee is a powerful web scraping and browser automation library that provides comprehensive documentation and numerous examples for developers. Whether you're building a simple web scraper or a complex data extraction pipeline, understanding where to find quality resources is essential for success.
Official Crawlee Documentation
JavaScript Documentation
The primary Crawlee documentation for JavaScript/TypeScript is hosted at crawlee.dev. This official resource provides:
- API Reference: Complete documentation for all classes, methods, and interfaces
- Guides and Tutorials: Step-by-step instructions for common use cases
- Migration Guides: Help transitioning from other scraping tools
- Best Practices: Performance optimization and production deployment tips
The JavaScript documentation is well-organized into sections:
# Install Crawlee for JavaScript/Node.js
npm install crawlee
# or
yarn add crawlee
Python Documentation
For Python developers, Crawlee's documentation is available at crawlee.dev/python. The Python version includes:
- Installation Instructions: Setup guides for various operating systems
- Quick Start Tutorial: Get up and running in minutes
- API Documentation: Detailed Python-specific API reference
- Examples Repository: Real-world scraping scenarios
# Install Crawlee for Python
pip install crawlee
# or
poetry add crawlee
Key Documentation Sections
1. Getting Started Guide
The getting started section walks you through your first Crawlee scraper. Here's a basic example from the docs:
JavaScript Example:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
async requestHandler({ request, page, enqueueLinks, log }) {
log.info(`Processing ${request.url}...`);
// Extract data from the page
const data = await page.evaluate(() => {
return {
title: document.querySelector('h1')?.textContent,
description: document.querySelector('meta[name="description"]')?.content
};
});
// Save the data
await Dataset.pushData(data);
// Find and enqueue links
await enqueueLinks({
selector: 'a[href]',
label: 'detail',
});
},
});
await crawler.run(['https://example.com']);
Python Example:
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
async def main():
crawler = PlaywrightCrawler()
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url}...')
# Extract data from the page
data = await context.page.evaluate('''() => {
return {
title: document.querySelector('h1')?.textContent,
description: document.querySelector('meta[name="description"]')?.textContent
};
}''')
# Save the data
await context.push_data(data)
# Find and enqueue links
await context.enqueue_links(selector='a[href]')
await crawler.run(['https://example.com'])
if __name__ == '__main__':
import asyncio
asyncio.run(main())
2. API Reference Documentation
The API reference provides exhaustive documentation for every class and method. Key classes include:
- PlaywrightCrawler: For JavaScript-heavy sites requiring browser automation similar to Puppeteer
- CheerioCrawler: For static HTML parsing (faster and more efficient)
- PuppeteerCrawler: Integration with Puppeteer for legacy projects
- HttpCrawler: For API scraping and simple HTTP requests
3. Examples and Use Cases
The documentation includes detailed examples for:
- E-commerce scraping: Product listings, prices, reviews
- Job board crawling: Structured job posting data
- News aggregation: Article extraction and monitoring
- Real estate data: Property listings and market data
- Social media monitoring: Public profile information
GitHub Repository and Examples
Official Examples Repository
Crawlee maintains an extensive examples repository on GitHub:
- JavaScript Examples: github.com/apify/crawlee/tree/master/packages/crawlee/examples
- Python Examples: github.com/apify/crawlee-python/tree/master/examples
These repositories contain production-ready code samples including:
// Advanced proxy rotation example
import { PlaywrightCrawler, ProxyConfiguration } from 'crawlee';
const proxyConfiguration = new ProxyConfiguration({
proxyUrls: [
'http://proxy1.example.com:8000',
'http://proxy2.example.com:8000',
],
});
const crawler = new PlaywrightCrawler({
proxyConfiguration,
async requestHandler({ request, page, log }) {
// Handle authentication challenges
page.on('dialog', async dialog => {
await dialog.accept();
});
// Wait for dynamic content
await page.waitForSelector('.product-list', { timeout: 30000 });
const products = await page.$$eval('.product-item', items => {
return items.map(item => ({
name: item.querySelector('.name')?.textContent,
price: item.querySelector('.price')?.textContent,
url: item.querySelector('a')?.href
}));
});
await Dataset.pushData(products);
},
maxRequestsPerCrawl: 100,
});
await crawler.run(['https://example-shop.com/products']);
Community Resources
Discord Community
Crawlee has an active Discord community where developers share examples and get help:
- Server: discord.gg/jyEM2PRvMU
- Support channels: Ask questions and get real-time help
- Examples sharing: Community members share their scraping solutions
- Announcements: Stay updated on new features and releases
Stack Overflow
Search for questions tagged with crawlee
:
# Search on Stack Overflow
[crawlee] your search query
YouTube Tutorials
The Apify YouTube channel features video tutorials covering:
- Introduction to Crawlee
- Building scrapers for specific websites
- Advanced techniques and optimization
- Handling dynamic content and AJAX requests
Apify Platform Integration
Crawlee is developed by Apify, and the Apify Platform provides additional resources:
- Apify Actors: Pre-built scraping solutions using Crawlee
- Templates: Starter projects for common scraping scenarios
- Apify SDK Documentation: Extended functionality for cloud deployment
- Video Courses: Free courses on web scraping with Crawlee
// Deploy Crawlee scraper to Apify
import { Actor } from 'apify';
import { PlaywrightCrawler } from 'crawlee';
await Actor.init();
const input = await Actor.getInput();
const crawler = new PlaywrightCrawler({
async requestHandler({ request, page, log }) {
// Your scraping logic
const data = await page.evaluate(() => ({
title: document.title,
url: window.location.href
}));
await Dataset.pushData(data);
},
});
await crawler.run(input.startUrls);
await Actor.exit();
Advanced Documentation Topics
Request Queue Management
Documentation on managing request queues for large-scale scraping:
from crawlee import RequestQueue
# Initialize a request queue
queue = await RequestQueue.open()
# Add multiple URLs
await queue.add_request('https://example.com/page1')
await queue.add_request('https://example.com/page2')
# Fetch next request
request = await queue.fetch_next_request()
Storage and Data Export
Learn about Crawlee's storage system for datasets, key-value stores, and request queues. The documentation covers:
- Dataset exports: JSON, CSV, Excel formats
- Key-value storage: For configuration and state management
- Request queue persistence: Resumable crawls
Session Management and Cookies
Documentation on handling authentication and maintaining sessions:
import { SessionPool } from 'crawlee';
const sessionPool = new SessionPool({
maxPoolSize: 20,
sessionOptions: {
maxAgeSecs: 3600,
maxUsageCount: 50,
},
});
const crawler = new PlaywrightCrawler({
sessionPoolOptions: {
maxPoolSize: 20,
},
async requestHandler({ session, page }) {
// Session is automatically managed
console.log(`Using session: ${session.id}`);
},
});
TypeScript Support
The JavaScript documentation includes comprehensive TypeScript definitions and examples:
import { PlaywrightCrawler, Dataset } from 'crawlee';
import { Page } from 'playwright';
interface ProductData {
name: string;
price: number;
url: string;
}
const crawler = new PlaywrightCrawler({
async requestHandler({ request, page, log }) {
const products: ProductData[] = await page.evaluate(() => {
return Array.from(document.querySelectorAll('.product')).map(el => ({
name: el.querySelector('.name')?.textContent ?? '',
price: parseFloat(el.querySelector('.price')?.textContent ?? '0'),
url: el.querySelector('a')?.href ?? ''
}));
});
await Dataset.pushData(products);
},
});
Keeping Up-to-Date
To stay current with Crawlee documentation updates:
- GitHub Releases: Watch the repository for release notes
- Blog Posts: Visit blog.apify.com for announcements
- Newsletter: Subscribe to Apify's developer newsletter
- Twitter/X: Follow @apify for updates
Troubleshooting and FAQ
The documentation includes a comprehensive troubleshooting section covering:
- Memory management for large crawls
- Debugging tips and logging configuration
- Common errors and their solutions
- Performance optimization strategies
Conclusion
Crawlee provides extensive, well-maintained documentation across multiple platforms. Whether you prefer JavaScript or Python, the official documentation at crawlee.dev offers comprehensive guides, API references, and practical examples. Combined with the active GitHub repository, Discord community, and Apify Platform resources, developers have all the tools needed to build robust web scraping solutions.
Start with the official documentation's getting started guide, explore the examples repository for your specific use case, and leverage the community resources when you need help. The documentation is regularly updated with new features and best practices, making it an invaluable resource for both beginners and experienced developers.