Are There Any Good Crawlee Tutorials for Beginners?

Yes, there are several excellent Crawlee tutorials and learning resources available for beginners. Whether you're new to web scraping or transitioning from other frameworks, Crawlee offers comprehensive documentation, official guides, and community resources to help you get started quickly.

Official Crawlee Documentation and Tutorials

The best place to start learning Crawlee is the official Crawlee documentation at crawlee.dev. The documentation is well-structured, beginner-friendly, and includes practical examples for both JavaScript/TypeScript and Python implementations.

Getting Started Guide

The official Getting Started guide walks you through:

Installation and Setup: Installing Crawlee via npm or pip
First Crawler: Building your first web scraper
Core Concepts: Understanding crawlers, request queues, and data storage
Best Practices: Following recommended patterns from the start

Here's a simple example from the official tutorial for JavaScript:

import { PlaywrightCrawler, Dataset } from 'crawlee';

// Create a PlaywrightCrawler instance
const crawler = new PlaywrightCrawler({
    // Handle each request with this function
    async requestHandler({ request, page, enqueueLinks, log }) {
        log.info(`Processing ${request.url}...`);

        // Extract the page title
        const title = await page.title();

        // Save data to the default dataset
        await Dataset.pushData({
            url: request.url,
            title,
        });

        // Find and enqueue all links on the page
        await enqueueLinks();
    },
    // Set maximum concurrency
    maxConcurrency: 10,
});

// Start the crawler with initial URLs
await crawler.run(['https://example.com']);

And the equivalent Python version:

from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext

async def main():
    # Create a PlaywrightCrawler instance
    crawler = PlaywrightCrawler(
        # Set maximum concurrency
        max_requests_per_crawl=100,
    )

    # Define the request handler
    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url}...')

        # Extract the page title
        title = await context.page.title()

        # Save data to the default dataset
        await context.push_data({
            'url': context.request.url,
            'title': title,
        })

        # Find and enqueue all links on the page
        await context.enqueue_links()

    # Start the crawler with initial URLs
    await crawler.run(['https://example.com'])

if __name__ == '__main__':
    import asyncio
    asyncio.run(main)()

Step-by-Step Tutorial: Building Your First Crawler

Let's walk through a complete beginner tutorial for scraping a website with Crawlee.

Step 1: Installation

First, install Crawlee in your project:

For JavaScript/TypeScript:

npm install crawlee playwright

For Python:

pip install crawlee[playwright]
playwright install

Step 2: Choose Your Crawler Type

Crawlee offers different crawler types depending on your needs:

CheerioCrawler: Fast, lightweight, for static HTML pages
PlaywrightCrawler: Full browser automation, handles JavaScript-rendered content
PuppeteerCrawler: Similar to PlaywrightCrawler but uses Puppeteer
JSDOMCrawler: Server-side JavaScript execution without a full browser

For beginners, CheerioCrawler is great for simple scraping tasks, while PlaywrightCrawler is better when you need to handle AJAX requests or interact with dynamic content.

Step 3: Create a Simple Scraper

Here's a practical example that scrapes product information:

import { CheerioCrawler, Dataset } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ request, $, enqueueLinks, log }) {
        log.info(`Scraping: ${request.url}`);

        // Check if this is a product page
        if (request.label === 'PRODUCT') {
            const title = $('h1.product-title').text().trim();
            const price = $('.product-price').text().trim();
            const description = $('.product-description').text().trim();

            // Save the extracted data
            await Dataset.pushData({
                url: request.url,
                title,
                price,
                description,
            });
        } else {
            // Enqueue product links
            await enqueueLinks({
                selector: 'a.product-link',
                label: 'PRODUCT',
            });
        }
    },
    maxRequestsPerCrawl: 50,
});

await crawler.run(['https://example-store.com/products']);

Step 4: Handle Pagination

Most real-world scenarios require handling pagination. Here's how:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    async requestHandler({ request, page, enqueueLinks, log }) {
        log.info(`Processing page: ${request.url}`);

        // Extract data from current page
        const products = await page.$$eval('.product-item', items => {
            return items.map(item => ({
                name: item.querySelector('.product-name')?.textContent?.trim(),
                price: item.querySelector('.product-price')?.textContent?.trim(),
            }));
        });

        await Dataset.pushData(products);

        // Find and click the next page button
        await enqueueLinks({
            selector: 'a.pagination-next',
            label: 'LIST',
        });
    },
});

await crawler.run(['https://example.com/products?page=1']);

Advanced Crawlee Tutorial Topics

Once you've mastered the basics, explore these intermediate topics:

Request Queue Management

Crawlee's RequestQueue helps you manage URLs efficiently:

import { PlaywrightCrawler, RequestQueue } from 'crawlee';

const requestQueue = await RequestQueue.open();

// Add requests with custom data
await requestQueue.addRequest({
    url: 'https://example.com/product/123',
    userData: {
        category: 'electronics',
        priority: 'high',
    },
});

const crawler = new PlaywrightCrawler({
    requestQueue,
    async requestHandler({ request, page, log }) {
        log.info(`Category: ${request.userData.category}`);
        // Process request...
    },
});

await crawler.run();

Session Management and Proxies

For scraping websites that require handling authentication or using proxies:

import { PlaywrightCrawler, ProxyConfiguration } from 'crawlee';

const proxyConfiguration = new ProxyConfiguration({
    proxyUrls: [
        'http://proxy1.example.com:8000',
        'http://proxy2.example.com:8000',
    ],
});

const crawler = new PlaywrightCrawler({
    proxyConfiguration,
    useSessionPool: true,
    persistCookiesPerSession: true,
    async requestHandler({ request, page, session, log }) {
        log.info(`Using session: ${session.id}`);
        // Your scraping logic...
    },
});

Error Handling and Retries

Crawlee automatically retries failed requests, but you can customize this behavior:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    maxRequestRetries: 5,
    maxRequestsPerCrawl: 1000,
    async requestHandler({ request, page, log }) {
        try {
            await page.waitForSelector('.content', { timeout: 10000 });
            // Extract data...
        } catch (error) {
            log.error(`Failed to scrape ${request.url}: ${error.message}`);
            throw error; // This will trigger a retry
        }
    },
    async failedRequestHandler({ request, log }) {
        log.error(`Request ${request.url} failed too many times.`);
    },
});

Community Resources and Video Tutorials

Beyond the official documentation, several community resources can help beginners:

YouTube Tutorials

Search for "Crawlee tutorial" on YouTube to find video walkthroughs. Look for recent tutorials (2023 or later) to ensure they cover the latest version.

GitHub Examples

The Crawlee GitHub repository contains an examples folder with numerous real-world use cases:

E-commerce scraping
News aggregation
Social media data extraction
API integration examples

Apify Academy

Since Crawlee is developed by Apify, their Apify Academy offers free courses on web scraping fundamentals that apply directly to Crawlee.

Practical Project Ideas for Learning

The best way to learn Crawlee is through hands-on projects. Here are beginner-friendly ideas:

Job Board Scraper: Extract job listings with titles, companies, and descriptions
Product Price Monitor: Track prices across multiple e-commerce sites
News Aggregator: Collect articles from various news websites
Real Estate Listings: Scrape property information and prices
Social Media Profile Scraper: Extract public profile information

Comparing Crawlee with Other Frameworks

As a beginner, it helps to understand when to use Crawlee versus other tools:

| Feature | Crawlee | Scrapy | Puppeteer/Playwright | |---------|---------|--------|----------------------| | JavaScript Support | ✅ Native | ❌ No | ✅ Native | | Python Support | ✅ Yes | ✅ Native | ⚠️ Limited | | Browser Automation | ✅ Built-in | ⚠️ Via plugins | ✅ Native | | Request Queue | ✅ Advanced | ✅ Built-in | ❌ Manual | | Auto-scaling | ✅ Yes | ⚠️ Limited | ❌ No | | Learning Curve | 🟢 Easy | 🟡 Moderate | 🟡 Moderate |

Crawlee shines when you need both simple HTTP scraping and complex browser automation in the same framework.

Best Practices for Crawlee Beginners

Follow these tips to avoid common pitfalls:

Start with CheerioCrawler: Use the lightest crawler that works for your use case
Use Request Labels: Organize different page types with labels
Implement Rate Limiting: Respect target websites with maxConcurrency and minConcurrency
Store Data Incrementally: Use Dataset.pushData() frequently to avoid data loss
Test with Small Crawls: Set maxRequestsPerCrawl low during development
Monitor Your Crawlers: Use the built-in logging to understand crawler behavior

Troubleshooting Common Beginner Issues

Issue: Crawler Doesn't Find Elements

// Bad: Not waiting for content to load
const title = await page.$eval('.title', el => el.textContent);

// Good: Wait for element before accessing
await page.waitForSelector('.title', { timeout: 5000 });
const title = await page.$eval('.title', el => el.textContent);

Issue: Too Many Concurrent Requests

// Configure concurrency appropriately
const crawler = new PlaywrightCrawler({
    maxConcurrency: 5, // Start conservative
    minConcurrency: 1,
    maxRequestsPerMinute: 60,
});

Issue: Memory Problems with Large Crawls

// Use streaming with large datasets
const dataset = await Dataset.open();
await dataset.pushData(data);

// Export periodically
await dataset.exportToJSON('output');

Next Steps After Basic Tutorials

Once you've completed beginner tutorials, explore:

TypeScript Integration: Add type safety to your crawlers
Cloud Deployment: Run crawlers on Apify platform or AWS
Advanced Selectors: Master CSS selectors and XPath
Custom Storage: Implement MongoDB or PostgreSQL integration
Monitoring and Alerting: Set up crawler health monitoring

Conclusion

Crawlee offers excellent tutorials and documentation for beginners, making it one of the most accessible web scraping frameworks available. Start with the official documentation, build simple projects, and gradually explore advanced features. The combination of comprehensive guides, practical examples, and active community support makes Crawlee an ideal choice for anyone learning web scraping.

Whether you're scraping single page applications or traditional websites, Crawlee provides the tools and tutorials you need to succeed. Begin with simple CheerioCrawler examples, progress to browser automation with PlaywrightCrawler, and soon you'll be building production-ready web scrapers with confidence.

Table of contents