Table of contents

Are There Any Good Crawlee Tutorials for Beginners?

Yes, there are several excellent Crawlee tutorials and learning resources available for beginners. Whether you're new to web scraping or transitioning from other frameworks, Crawlee offers comprehensive documentation, official guides, and community resources to help you get started quickly.

Official Crawlee Documentation and Tutorials

The best place to start learning Crawlee is the official Crawlee documentation at crawlee.dev. The documentation is well-structured, beginner-friendly, and includes practical examples for both JavaScript/TypeScript and Python implementations.

Getting Started Guide

The official Getting Started guide walks you through:

  1. Installation and Setup: Installing Crawlee via npm or pip
  2. First Crawler: Building your first web scraper
  3. Core Concepts: Understanding crawlers, request queues, and data storage
  4. Best Practices: Following recommended patterns from the start

Here's a simple example from the official tutorial for JavaScript:

import { PlaywrightCrawler, Dataset } from 'crawlee';

// Create a PlaywrightCrawler instance
const crawler = new PlaywrightCrawler({
    // Handle each request with this function
    async requestHandler({ request, page, enqueueLinks, log }) {
        log.info(`Processing ${request.url}...`);

        // Extract the page title
        const title = await page.title();

        // Save data to the default dataset
        await Dataset.pushData({
            url: request.url,
            title,
        });

        // Find and enqueue all links on the page
        await enqueueLinks();
    },
    // Set maximum concurrency
    maxConcurrency: 10,
});

// Start the crawler with initial URLs
await crawler.run(['https://example.com']);

And the equivalent Python version:

from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext

async def main():
    # Create a PlaywrightCrawler instance
    crawler = PlaywrightCrawler(
        # Set maximum concurrency
        max_requests_per_crawl=100,
    )

    # Define the request handler
    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url}...')

        # Extract the page title
        title = await context.page.title()

        # Save data to the default dataset
        await context.push_data({
            'url': context.request.url,
            'title': title,
        })

        # Find and enqueue all links on the page
        await context.enqueue_links()

    # Start the crawler with initial URLs
    await crawler.run(['https://example.com'])

if __name__ == '__main__':
    import asyncio
    asyncio.run(main)()

Step-by-Step Tutorial: Building Your First Crawler

Let's walk through a complete beginner tutorial for scraping a website with Crawlee.

Step 1: Installation

First, install Crawlee in your project:

For JavaScript/TypeScript:

npm install crawlee playwright

For Python:

pip install crawlee[playwright]
playwright install

Step 2: Choose Your Crawler Type

Crawlee offers different crawler types depending on your needs:

  • CheerioCrawler: Fast, lightweight, for static HTML pages
  • PlaywrightCrawler: Full browser automation, handles JavaScript-rendered content
  • PuppeteerCrawler: Similar to PlaywrightCrawler but uses Puppeteer
  • JSDOMCrawler: Server-side JavaScript execution without a full browser

For beginners, CheerioCrawler is great for simple scraping tasks, while PlaywrightCrawler is better when you need to handle AJAX requests or interact with dynamic content.

Step 3: Create a Simple Scraper

Here's a practical example that scrapes product information:

import { CheerioCrawler, Dataset } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ request, $, enqueueLinks, log }) {
        log.info(`Scraping: ${request.url}`);

        // Check if this is a product page
        if (request.label === 'PRODUCT') {
            const title = $('h1.product-title').text().trim();
            const price = $('.product-price').text().trim();
            const description = $('.product-description').text().trim();

            // Save the extracted data
            await Dataset.pushData({
                url: request.url,
                title,
                price,
                description,
            });
        } else {
            // Enqueue product links
            await enqueueLinks({
                selector: 'a.product-link',
                label: 'PRODUCT',
            });
        }
    },
    maxRequestsPerCrawl: 50,
});

await crawler.run(['https://example-store.com/products']);

Step 4: Handle Pagination

Most real-world scenarios require handling pagination. Here's how:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    async requestHandler({ request, page, enqueueLinks, log }) {
        log.info(`Processing page: ${request.url}`);

        // Extract data from current page
        const products = await page.$$eval('.product-item', items => {
            return items.map(item => ({
                name: item.querySelector('.product-name')?.textContent?.trim(),
                price: item.querySelector('.product-price')?.textContent?.trim(),
            }));
        });

        await Dataset.pushData(products);

        // Find and click the next page button
        await enqueueLinks({
            selector: 'a.pagination-next',
            label: 'LIST',
        });
    },
});

await crawler.run(['https://example.com/products?page=1']);

Advanced Crawlee Tutorial Topics

Once you've mastered the basics, explore these intermediate topics:

Request Queue Management

Crawlee's RequestQueue helps you manage URLs efficiently:

import { PlaywrightCrawler, RequestQueue } from 'crawlee';

const requestQueue = await RequestQueue.open();

// Add requests with custom data
await requestQueue.addRequest({
    url: 'https://example.com/product/123',
    userData: {
        category: 'electronics',
        priority: 'high',
    },
});

const crawler = new PlaywrightCrawler({
    requestQueue,
    async requestHandler({ request, page, log }) {
        log.info(`Category: ${request.userData.category}`);
        // Process request...
    },
});

await crawler.run();

Session Management and Proxies

For scraping websites that require handling authentication or using proxies:

import { PlaywrightCrawler, ProxyConfiguration } from 'crawlee';

const proxyConfiguration = new ProxyConfiguration({
    proxyUrls: [
        'http://proxy1.example.com:8000',
        'http://proxy2.example.com:8000',
    ],
});

const crawler = new PlaywrightCrawler({
    proxyConfiguration,
    useSessionPool: true,
    persistCookiesPerSession: true,
    async requestHandler({ request, page, session, log }) {
        log.info(`Using session: ${session.id}`);
        // Your scraping logic...
    },
});

Error Handling and Retries

Crawlee automatically retries failed requests, but you can customize this behavior:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    maxRequestRetries: 5,
    maxRequestsPerCrawl: 1000,
    async requestHandler({ request, page, log }) {
        try {
            await page.waitForSelector('.content', { timeout: 10000 });
            // Extract data...
        } catch (error) {
            log.error(`Failed to scrape ${request.url}: ${error.message}`);
            throw error; // This will trigger a retry
        }
    },
    async failedRequestHandler({ request, log }) {
        log.error(`Request ${request.url} failed too many times.`);
    },
});

Community Resources and Video Tutorials

Beyond the official documentation, several community resources can help beginners:

YouTube Tutorials

Search for "Crawlee tutorial" on YouTube to find video walkthroughs. Look for recent tutorials (2023 or later) to ensure they cover the latest version.

GitHub Examples

The Crawlee GitHub repository contains an examples folder with numerous real-world use cases:

  • E-commerce scraping
  • News aggregation
  • Social media data extraction
  • API integration examples

Apify Academy

Since Crawlee is developed by Apify, their Apify Academy offers free courses on web scraping fundamentals that apply directly to Crawlee.

Practical Project Ideas for Learning

The best way to learn Crawlee is through hands-on projects. Here are beginner-friendly ideas:

  1. Job Board Scraper: Extract job listings with titles, companies, and descriptions
  2. Product Price Monitor: Track prices across multiple e-commerce sites
  3. News Aggregator: Collect articles from various news websites
  4. Real Estate Listings: Scrape property information and prices
  5. Social Media Profile Scraper: Extract public profile information

Comparing Crawlee with Other Frameworks

As a beginner, it helps to understand when to use Crawlee versus other tools:

| Feature | Crawlee | Scrapy | Puppeteer/Playwright | |---------|---------|--------|----------------------| | JavaScript Support | ✅ Native | ❌ No | ✅ Native | | Python Support | ✅ Yes | ✅ Native | ⚠️ Limited | | Browser Automation | ✅ Built-in | ⚠️ Via plugins | ✅ Native | | Request Queue | ✅ Advanced | ✅ Built-in | ❌ Manual | | Auto-scaling | ✅ Yes | ⚠️ Limited | ❌ No | | Learning Curve | 🟢 Easy | 🟡 Moderate | 🟡 Moderate |

Crawlee shines when you need both simple HTTP scraping and complex browser automation in the same framework.

Best Practices for Crawlee Beginners

Follow these tips to avoid common pitfalls:

  1. Start with CheerioCrawler: Use the lightest crawler that works for your use case
  2. Use Request Labels: Organize different page types with labels
  3. Implement Rate Limiting: Respect target websites with maxConcurrency and minConcurrency
  4. Store Data Incrementally: Use Dataset.pushData() frequently to avoid data loss
  5. Test with Small Crawls: Set maxRequestsPerCrawl low during development
  6. Monitor Your Crawlers: Use the built-in logging to understand crawler behavior

Troubleshooting Common Beginner Issues

Issue: Crawler Doesn't Find Elements

// Bad: Not waiting for content to load
const title = await page.$eval('.title', el => el.textContent);

// Good: Wait for element before accessing
await page.waitForSelector('.title', { timeout: 5000 });
const title = await page.$eval('.title', el => el.textContent);

Issue: Too Many Concurrent Requests

// Configure concurrency appropriately
const crawler = new PlaywrightCrawler({
    maxConcurrency: 5, // Start conservative
    minConcurrency: 1,
    maxRequestsPerMinute: 60,
});

Issue: Memory Problems with Large Crawls

// Use streaming with large datasets
const dataset = await Dataset.open();
await dataset.pushData(data);

// Export periodically
await dataset.exportToJSON('output');

Next Steps After Basic Tutorials

Once you've completed beginner tutorials, explore:

  • TypeScript Integration: Add type safety to your crawlers
  • Cloud Deployment: Run crawlers on Apify platform or AWS
  • Advanced Selectors: Master CSS selectors and XPath
  • Custom Storage: Implement MongoDB or PostgreSQL integration
  • Monitoring and Alerting: Set up crawler health monitoring

Conclusion

Crawlee offers excellent tutorials and documentation for beginners, making it one of the most accessible web scraping frameworks available. Start with the official documentation, build simple projects, and gradually explore advanced features. The combination of comprehensive guides, practical examples, and active community support makes Crawlee an ideal choice for anyone learning web scraping.

Whether you're scraping single page applications or traditional websites, Crawlee provides the tools and tutorials you need to succeed. Begin with simple CheerioCrawler examples, progress to browser automation with PlaywrightCrawler, and soon you'll be building production-ready web scrapers with confidence.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon