Where is the Crawlee GitHub Repository?

The official Crawlee GitHub repository is located at https://github.com/apify/crawlee. This is the primary source for Crawlee's codebase, documentation, examples, and community contributions. The repository is actively maintained by Apify and the open-source community, making it an essential resource for developers working with this powerful web scraping and browser automation framework.

Overview of the Crawlee Repository

Crawlee is an open-source web scraping and browser automation library available for both JavaScript/TypeScript (Node.js) and Python. The GitHub repository serves as the central hub for all Crawlee-related resources, including:

Source code for the Crawlee library
Comprehensive documentation and API references
Example projects demonstrating various use cases
Issue tracker for bug reports and feature requests
Discussion forums for community support
Contribution guidelines for developers who want to contribute

Repository Structure

The Crawlee monorepo is organized into multiple packages and versions:

JavaScript/TypeScript Version

The main branch contains the Node.js implementation of Crawlee, which includes several specialized crawler packages:

# Clone the repository
git clone https://github.com/apify/crawlee.git
cd crawlee

# Install dependencies
npm install

# Build all packages
npm run build

The repository includes these core packages:

crawlee - The main package with all crawler types
@crawlee/core - Core functionality shared across all crawlers
@crawlee/cheerio - Fast HTML crawler using Cheerio
@crawlee/puppeteer - Browser automation with Puppeteer
@crawlee/playwright - Browser automation with Playwright
@crawlee/jsdom - DOM manipulation using JSDOM
@crawlee/http - Simple HTTP crawler

Python Version

The Python implementation is located in a separate directory within the repository:

# Navigate to Python implementation
cd crawlee/crawlee-python

# Install in development mode
pip install -e .

Key Features Available in the Repository

1. Example Projects

The repository contains numerous example projects in the examples directory that demonstrate common web scraping scenarios:

// Example: Basic CheerioCrawler usage from the repository
import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
    requestHandler: async ({ request, $, enqueueLinks }) => {
        const title = $('title').text();
        console.log(`Title of ${request.url}: ${title}`);

        // Enqueue all links on the page
        await enqueueLinks({
            globs: ['https://example.com/**'],
        });
    },
});

await crawler.run(['https://example.com']);

For scenarios requiring JavaScript rendering, you can find examples using browser automation similar to Puppeteer:

// Example: PuppeteerCrawler from repository examples
import { PuppeteerCrawler } from 'crawlee';

const crawler = new PuppeteerCrawler({
    requestHandler: async ({ page, request, enqueueLinks }) => {
        // Wait for dynamic content to load
        await page.waitForSelector('.dynamic-content');

        const title = await page.title();
        console.log(`Page title: ${title}`);

        // Enqueue additional pages
        await enqueueLinks();
    },
});

await crawler.run(['https://example.com']);

2. Documentation

The repository includes comprehensive documentation in markdown format:

Getting Started guides for beginners
API Reference with detailed parameter descriptions
Migration guides for upgrading between versions
Best practices for efficient web scraping

3. Issue Tracking and Bug Reports

The GitHub Issues section allows you to:

Report bugs with detailed reproduction steps
Request new features
Track the status of known issues
Search for solutions to common problems

# Search for issues related to proxy configuration
# Visit: https://github.com/apify/crawlee/issues?q=is%3Aissue+proxy

How to Contribute to Crawlee

The repository welcomes contributions from the community. Here's how to get started:

1. Fork and Clone

# Fork the repository on GitHub, then clone your fork
git clone https://github.com/YOUR_USERNAME/crawlee.git
cd crawlee

# Add upstream remote
git remote add upstream https://github.com/apify/crawlee.git

2. Set Up Development Environment

# Install dependencies
npm install

# Run tests
npm test

# Run linting
npm run lint

3. Create a Pull Request

# Create a new branch
git checkout -b feature/my-new-feature

# Make your changes and commit
git add .
git commit -m "Add new feature"

# Push to your fork
git push origin feature/my-new-feature

Then open a pull request on GitHub with a clear description of your changes.

Repository Resources

Package Versions and Releases

The repository maintains detailed release notes for each version:

# View all releases
# Visit: https://github.com/apify/crawlee/releases

# Install specific version
npm install crawlee@3.5.0

Community and Support

GitHub Discussions: Ask questions and share ideas
Discord Server: Real-time chat with the community (link in repository README)
Stack Overflow: Tag questions with crawlee

TypeScript Support

Crawlee is written in TypeScript, providing excellent type safety. The repository includes all type definitions:

import { CheerioCrawler, Dataset } from 'crawlee';

interface ProductData {
    title: string;
    price: number;
    url: string;
}

const crawler = new CheerioCrawler({
    requestHandler: async ({ request, $, log }) => {
        const products: ProductData[] = [];

        $('.product').each((_, element) => {
            const title = $(element).find('.title').text();
            const price = parseFloat($(element).find('.price').text());

            products.push({
                title,
                price,
                url: request.url,
            });
        });

        // Save to dataset
        await Dataset.pushData(products);
    },
});

Advanced Features from the Repository

Session Management

Crawlee includes sophisticated session handling and authentication capabilities:

import { CheerioCrawler, SessionPool } from 'crawlee';

const sessionPool = new SessionPool({
    maxPoolSize: 50,
    sessionOptions: {
        maxUsageCount: 100,
    },
});

const crawler = new CheerioCrawler({
    sessionPoolOptions: {
        sessionOptions: {
            maxUsageCount: 100,
        },
    },
    requestHandler: async ({ request, session }) => {
        console.log(`Using session: ${session.id}`);
        // Your scraping logic here
    },
});

Request Queue Management

The repository provides robust queue management for large-scale scraping:

import { CheerioCrawler, RequestQueue } from 'crawlee';

const requestQueue = await RequestQueue.open();

// Add initial URLs
await requestQueue.addRequest({ url: 'https://example.com' });
await requestQueue.addRequest({ url: 'https://example.com/products' });

const crawler = new CheerioCrawler({
    requestQueue,
    requestHandler: async ({ request, enqueueLinks }) => {
        // Process page and enqueue more URLs
        await enqueueLinks({
            globs: ['https://example.com/products/*'],
        });
    },
});

await crawler.run();

Error Handling and Retries

Crawlee includes built-in error handling with configurable retry logic:

import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
    maxRequestRetries: 5,
    maxConcurrency: 10,
    requestHandlerTimeoutSecs: 60,

    requestHandler: async ({ request, log }) => {
        log.info(`Processing ${request.url}`);
        // Your scraping logic
    },

    failedRequestHandler: async ({ request, log }) => {
        log.error(`Request ${request.url} failed after retries`);
    },
});

Python Repository Usage

For Python developers, the repository includes a full Python implementation:

from crawlee.playwright_crawler import PlaywrightCrawler
from crawlee.router import Router

router = Router()

@router.default_handler
async def default_handler(context):
    """Default request handler."""
    page = context.page
    await page.wait_for_selector('.content')

    title = await page.title()
    print(f'Page title: {title}')

    # Enqueue additional links
    await context.enqueue_links()

async def main():
    crawler = PlaywrightCrawler(
        request_handler=router,
        max_requests_per_crawl=100,
    )

    await crawler.run(['https://example.com'])

if __name__ == '__main__':
    import asyncio
    asyncio.run(main())

Staying Updated

To stay current with Crawlee development:

Watch the Repository

Click the "Watch" button on GitHub to receive notifications about: - New releases - Important security updates - Breaking changes

Follow Release Notes

# Check latest release
# Visit: https://github.com/apify/crawlee/releases/latest

# View changelog
# Visit: https://github.com/apify/crawlee/blob/master/CHANGELOG.md

Subscribe to Newsletter

The Apify team maintains a newsletter with updates about Crawlee and web scraping best practices.

Conclusion

The Crawlee GitHub repository at https://github.com/apify/crawlee is the definitive resource for everything related to this powerful web scraping framework. Whether you're looking for source code, documentation, examples, or community support, the repository provides comprehensive resources for developers at all skill levels. By engaging with the repository—whether through using the code, reporting issues, or contributing improvements—you become part of a vibrant community dedicated to making web scraping more accessible and efficient.

The active development and maintenance of the repository ensure that Crawlee continues to evolve with modern web scraping needs, incorporating new features, performance improvements, and security updates regularly. Make sure to star the repository to bookmark it for future reference and to show your support for this excellent open-source project.

Table of contents