Table of contents

Where is the Crawlee GitHub Repository?

The official Crawlee GitHub repository is located at https://github.com/apify/crawlee. This is the primary source for Crawlee's codebase, documentation, examples, and community contributions. The repository is actively maintained by Apify and the open-source community, making it an essential resource for developers working with this powerful web scraping and browser automation framework.

Overview of the Crawlee Repository

Crawlee is an open-source web scraping and browser automation library available for both JavaScript/TypeScript (Node.js) and Python. The GitHub repository serves as the central hub for all Crawlee-related resources, including:

  • Source code for the Crawlee library
  • Comprehensive documentation and API references
  • Example projects demonstrating various use cases
  • Issue tracker for bug reports and feature requests
  • Discussion forums for community support
  • Contribution guidelines for developers who want to contribute

Repository Structure

The Crawlee monorepo is organized into multiple packages and versions:

JavaScript/TypeScript Version

The main branch contains the Node.js implementation of Crawlee, which includes several specialized crawler packages:

# Clone the repository
git clone https://github.com/apify/crawlee.git
cd crawlee

# Install dependencies
npm install

# Build all packages
npm run build

The repository includes these core packages:

  • crawlee - The main package with all crawler types
  • @crawlee/core - Core functionality shared across all crawlers
  • @crawlee/cheerio - Fast HTML crawler using Cheerio
  • @crawlee/puppeteer - Browser automation with Puppeteer
  • @crawlee/playwright - Browser automation with Playwright
  • @crawlee/jsdom - DOM manipulation using JSDOM
  • @crawlee/http - Simple HTTP crawler

Python Version

The Python implementation is located in a separate directory within the repository:

# Navigate to Python implementation
cd crawlee/crawlee-python

# Install in development mode
pip install -e .

Key Features Available in the Repository

1. Example Projects

The repository contains numerous example projects in the examples directory that demonstrate common web scraping scenarios:

// Example: Basic CheerioCrawler usage from the repository
import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
    requestHandler: async ({ request, $, enqueueLinks }) => {
        const title = $('title').text();
        console.log(`Title of ${request.url}: ${title}`);

        // Enqueue all links on the page
        await enqueueLinks({
            globs: ['https://example.com/**'],
        });
    },
});

await crawler.run(['https://example.com']);

For scenarios requiring JavaScript rendering, you can find examples using browser automation similar to Puppeteer:

// Example: PuppeteerCrawler from repository examples
import { PuppeteerCrawler } from 'crawlee';

const crawler = new PuppeteerCrawler({
    requestHandler: async ({ page, request, enqueueLinks }) => {
        // Wait for dynamic content to load
        await page.waitForSelector('.dynamic-content');

        const title = await page.title();
        console.log(`Page title: ${title}`);

        // Enqueue additional pages
        await enqueueLinks();
    },
});

await crawler.run(['https://example.com']);

2. Documentation

The repository includes comprehensive documentation in markdown format:

  • Getting Started guides for beginners
  • API Reference with detailed parameter descriptions
  • Migration guides for upgrading between versions
  • Best practices for efficient web scraping

3. Issue Tracking and Bug Reports

The GitHub Issues section allows you to:

  • Report bugs with detailed reproduction steps
  • Request new features
  • Track the status of known issues
  • Search for solutions to common problems
# Search for issues related to proxy configuration
# Visit: https://github.com/apify/crawlee/issues?q=is%3Aissue+proxy

How to Contribute to Crawlee

The repository welcomes contributions from the community. Here's how to get started:

1. Fork and Clone

# Fork the repository on GitHub, then clone your fork
git clone https://github.com/YOUR_USERNAME/crawlee.git
cd crawlee

# Add upstream remote
git remote add upstream https://github.com/apify/crawlee.git

2. Set Up Development Environment

# Install dependencies
npm install

# Run tests
npm test

# Run linting
npm run lint

3. Create a Pull Request

# Create a new branch
git checkout -b feature/my-new-feature

# Make your changes and commit
git add .
git commit -m "Add new feature"

# Push to your fork
git push origin feature/my-new-feature

Then open a pull request on GitHub with a clear description of your changes.

Repository Resources

Package Versions and Releases

The repository maintains detailed release notes for each version:

# View all releases
# Visit: https://github.com/apify/crawlee/releases

# Install specific version
npm install crawlee@3.5.0

Community and Support

  • GitHub Discussions: Ask questions and share ideas
  • Discord Server: Real-time chat with the community (link in repository README)
  • Stack Overflow: Tag questions with crawlee

TypeScript Support

Crawlee is written in TypeScript, providing excellent type safety. The repository includes all type definitions:

import { CheerioCrawler, Dataset } from 'crawlee';

interface ProductData {
    title: string;
    price: number;
    url: string;
}

const crawler = new CheerioCrawler({
    requestHandler: async ({ request, $, log }) => {
        const products: ProductData[] = [];

        $('.product').each((_, element) => {
            const title = $(element).find('.title').text();
            const price = parseFloat($(element).find('.price').text());

            products.push({
                title,
                price,
                url: request.url,
            });
        });

        // Save to dataset
        await Dataset.pushData(products);
    },
});

Advanced Features from the Repository

Session Management

Crawlee includes sophisticated session handling and authentication capabilities:

import { CheerioCrawler, SessionPool } from 'crawlee';

const sessionPool = new SessionPool({
    maxPoolSize: 50,
    sessionOptions: {
        maxUsageCount: 100,
    },
});

const crawler = new CheerioCrawler({
    sessionPoolOptions: {
        sessionOptions: {
            maxUsageCount: 100,
        },
    },
    requestHandler: async ({ request, session }) => {
        console.log(`Using session: ${session.id}`);
        // Your scraping logic here
    },
});

Request Queue Management

The repository provides robust queue management for large-scale scraping:

import { CheerioCrawler, RequestQueue } from 'crawlee';

const requestQueue = await RequestQueue.open();

// Add initial URLs
await requestQueue.addRequest({ url: 'https://example.com' });
await requestQueue.addRequest({ url: 'https://example.com/products' });

const crawler = new CheerioCrawler({
    requestQueue,
    requestHandler: async ({ request, enqueueLinks }) => {
        // Process page and enqueue more URLs
        await enqueueLinks({
            globs: ['https://example.com/products/*'],
        });
    },
});

await crawler.run();

Error Handling and Retries

Crawlee includes built-in error handling with configurable retry logic:

import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
    maxRequestRetries: 5,
    maxConcurrency: 10,
    requestHandlerTimeoutSecs: 60,

    requestHandler: async ({ request, log }) => {
        log.info(`Processing ${request.url}`);
        // Your scraping logic
    },

    failedRequestHandler: async ({ request, log }) => {
        log.error(`Request ${request.url} failed after retries`);
    },
});

Python Repository Usage

For Python developers, the repository includes a full Python implementation:

from crawlee.playwright_crawler import PlaywrightCrawler
from crawlee.router import Router

router = Router()

@router.default_handler
async def default_handler(context):
    """Default request handler."""
    page = context.page
    await page.wait_for_selector('.content')

    title = await page.title()
    print(f'Page title: {title}')

    # Enqueue additional links
    await context.enqueue_links()

async def main():
    crawler = PlaywrightCrawler(
        request_handler=router,
        max_requests_per_crawl=100,
    )

    await crawler.run(['https://example.com'])

if __name__ == '__main__':
    import asyncio
    asyncio.run(main())

Staying Updated

To stay current with Crawlee development:

Watch the Repository

Click the "Watch" button on GitHub to receive notifications about: - New releases - Important security updates - Breaking changes

Follow Release Notes

# Check latest release
# Visit: https://github.com/apify/crawlee/releases/latest

# View changelog
# Visit: https://github.com/apify/crawlee/blob/master/CHANGELOG.md

Subscribe to Newsletter

The Apify team maintains a newsletter with updates about Crawlee and web scraping best practices.

Conclusion

The Crawlee GitHub repository at https://github.com/apify/crawlee is the definitive resource for everything related to this powerful web scraping framework. Whether you're looking for source code, documentation, examples, or community support, the repository provides comprehensive resources for developers at all skill levels. By engaging with the repository—whether through using the code, reporting issues, or contributing improvements—you become part of a vibrant community dedicated to making web scraping more accessible and efficient.

The active development and maintenance of the repository ensure that Crawlee continues to evolve with modern web scraping needs, incorporating new features, performance improvements, and security updates regularly. Make sure to star the repository to bookmark it for future reference and to show your support for this excellent open-source project.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon