How do I contribute to Crawlee open source project?

Contributing to the Crawlee open source project is a rewarding way to give back to the web scraping community, improve your coding skills, and collaborate with developers worldwide. Crawlee is actively maintained by Apify and welcomes contributions from developers of all skill levels. This guide will walk you through everything you need to know to start contributing.

Understanding the Crawlee Project Structure

Crawlee is a web scraping and browser automation library available for both Node.js/TypeScript and Python. The project is hosted on GitHub and consists of multiple packages organized in a monorepo structure.

The main repository is located at https://github.com/apify/crawlee, which contains:

Core packages: The fundamental crawling functionality
Browser crawlers: Integration with Puppeteer, Playwright, and other browser automation tools
HTTP crawlers: Lightweight crawlers for static content
Utilities: Helper functions and tools for web scraping tasks
Documentation: Comprehensive guides and API references
Examples: Sample projects demonstrating various use cases

Setting Up Your Development Environment

Before you can contribute code, you need to set up your local development environment.

Fork and Clone the Repository

Start by forking the Crawlee repository to your GitHub account, then clone it locally:

# Clone your fork
git clone https://github.com/YOUR_USERNAME/crawlee.git
cd crawlee

# Add the upstream repository
git remote add upstream https://github.com/apify/crawlee.git

Install Dependencies

Crawlee uses npm workspaces for managing its monorepo structure. Install all dependencies:

# Install all dependencies
npm install

# Build all packages
npm run build

For Python contributions, you'll need to set up a Python environment:

# Create a virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -e ".[dev]"

Verify Your Setup

Run the test suite to ensure everything is working correctly:

# Run tests for JavaScript/TypeScript
npm test

# Run tests for a specific package
npm test -w packages/core

# Run Python tests
pytest

Types of Contributions

There are several ways you can contribute to Crawlee, regardless of your experience level.

Code Contributions

Code contributions include bug fixes, new features, performance improvements, and refactoring. Here's how to approach them:

1. Find an Issue or Propose a Feature

Browse the GitHub Issues page to find open issues labeled as good first issue or help wanted. These are great starting points for new contributors.

# Create a new branch for your work
git checkout -b fix/issue-description

# Or for a feature
git checkout -b feat/feature-description

2. Write Clean, Tested Code

Crawlee maintains high code quality standards. When writing code:

// Example: Adding a new method to CheerioCrawler
import { CheerioCrawler } from 'crawlee';

// Always add TypeScript types
interface CustomHandlerOptions {
    maxRetries: number;
    timeout: number;
}

// Document your code with JSDoc comments
/**
 * Custom request handler with retry logic
 * @param options - Configuration options for the handler
 * @returns A configured request handler function
 */
export function createCustomHandler(options: CustomHandlerOptions) {
    return async ({ request, $, enqueueLinks }) => {
        // Your implementation here
        const { maxRetries, timeout } = options;

        // Extract data
        const title = $('h1').text();

        // Enqueue discovered links
        await enqueueLinks({
            selector: 'a[href]',
            label: 'DETAIL',
        });

        return { title };
    };
}

3. Add Tests

Every code contribution should include appropriate tests:

// Example test file: custom-handler.test.ts
import { describe, it, expect } from 'vitest';
import { createCustomHandler } from './custom-handler';

describe('createCustomHandler', () => {
    it('should create a handler with correct options', async () => {
        const handler = createCustomHandler({
            maxRetries: 3,
            timeout: 5000,
        });

        expect(handler).toBeInstanceOf(Function);
    });

    it('should extract title correctly', async () => {
        const handler = createCustomHandler({
            maxRetries: 3,
            timeout: 5000,
        });

        const mockContext = {
            request: { url: 'https://example.com' },
            $: cheerio.load('<h1>Test Title</h1>'),
            enqueueLinks: jest.fn(),
        };

        const result = await handler(mockContext);
        expect(result.title).toBe('Test Title');
    });
});

Documentation Contributions

Documentation is crucial for any open source project. You can contribute by:

Fixing typos and grammatical errors
Improving existing documentation clarity
Adding examples and tutorials
Translating documentation to other languages
Creating video tutorials or blog posts

Documentation files are typically located in the docs/ directory and written in Markdown:

# Example Documentation Addition

## Using Custom Storage

Crawlee allows you to configure custom storage backends for datasets, key-value stores, and request queues.

\`\`\`javascript
import { CheerioCrawler, Configuration } from 'crawlee';

const crawler = new CheerioCrawler({
    requestHandler: async ({ request, $ }) => {
        // Your scraping logic
    },
});

await crawler.run(['https://example.com']);
\`\`\`

This approach is particularly useful when [handling browser sessions in Puppeteer](/faq/puppeteer/how-to-handle-browser-sessions-in-puppeteer) for authenticated scraping scenarios.

Reporting Bugs

High-quality bug reports help maintainers fix issues quickly. A good bug report includes:

Clear title: Describe the issue concisely
Environment details: Node.js version, Crawlee version, operating system
Steps to reproduce: Minimal code example that demonstrates the bug
Expected behavior: What should happen
Actual behavior: What actually happens
Additional context: Error messages, stack traces, screenshots

// Example minimal reproducible code
import { PuppeteerCrawler } from 'crawlee';

const crawler = new PuppeteerCrawler({
    requestHandler: async ({ page }) => {
        // This causes the bug
        await page.waitForSelector('.non-existent-selector', {
            timeout: 1000,
        });
    },
});

await crawler.run(['https://example.com']);

Making Your First Pull Request

Once you've made your changes, it's time to submit a pull request (PR).

Pre-commit Checklist

Before submitting, ensure:

# Run linter
npm run lint

# Fix linting issues automatically
npm run lint:fix

# Run tests
npm test

# Build the project
npm run build

# Check types
npm run type-check

Commit Message Guidelines

Crawlee follows conventional commit message format:

# Feature
git commit -m "feat: add retry mechanism to PlaywrightCrawler"

# Bug fix
git commit -m "fix: resolve memory leak in request queue"

# Documentation
git commit -m "docs: update CheerioCrawler examples"

# Refactoring
git commit -m "refactor: simplify session pool logic"

# Tests
git commit -m "test: add integration tests for proxy rotation"

Create the Pull Request

Push your changes and create a PR:

# Push to your fork
git push origin your-branch-name

Visit GitHub and click "New Pull Request". In your PR description:

Reference related issues (e.g., "Fixes #123")
Describe what changed and why
Highlight any breaking changes
Add screenshots or examples if applicable

Code Review Process

After submitting your PR:

Automated checks: CI/CD pipelines will run tests and linting
Maintainer review: A project maintainer will review your code
Feedback: You may receive comments requesting changes
Iteration: Make requested changes and push updates
Approval: Once approved, a maintainer will merge your PR

Be patient and responsive during code review. When handling errors in Puppeteer or similar complex scenarios, reviewers may suggest alternative approaches that improve reliability.

Contributing to the Community

Code isn't the only way to contribute. You can also:

Answer Questions

Participate in GitHub Discussions
Help others on Stack Overflow with the crawlee tag
Join the Apify Discord server and answer questions
Share your knowledge on social media and developer forums

Create Examples and Tutorials

// Example: E-commerce scraper tutorial
import { PlaywrightCrawler, Dataset } from 'crawlee';

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page, request, enqueueLinks }) => {
        // Navigate to product listing
        if (request.label === 'CATEGORY') {
            await enqueueLinks({
                selector: '.product-card a',
                label: 'PRODUCT',
            });
        }

        // Scrape product details
        if (request.label === 'PRODUCT') {
            const title = await page.locator('h1.product-title').textContent();
            const price = await page.locator('.price').textContent();
            const description = await page.locator('.description').textContent();

            await Dataset.pushData({
                url: request.url,
                title,
                price,
                description,
            });
        }
    },
});

await crawler.run([{
    url: 'https://example-shop.com/category',
    label: 'CATEGORY',
}]);

This type of practical example helps developers understand how to crawl single page applications using Puppeteer and adapt similar techniques with Crawlee.

Improve Tooling and Infrastructure

Help improve the development experience by contributing to:

Build scripts and CI/CD pipelines
Development environment setup
Testing infrastructure
Documentation generation tools

Best Practices for Contributors

Start small: Begin with documentation fixes or simple bug fixes before tackling major features
Communicate early: Open an issue or discussion before starting work on significant changes
Follow coding standards: Use the project's linting rules and style guide
Write comprehensive tests: Aim for high test coverage on new code
Update documentation: Always update relevant docs when changing functionality
Be respectful: Follow the project's code of conduct and maintain professional communication
Stay updated: Regularly sync your fork with the upstream repository

# Keep your fork updated
git fetch upstream
git checkout master
git merge upstream/master
git push origin master

Getting Help

If you need assistance:

GitHub Discussions: Ask questions about contributing
Discord: Join the Apify Discord server for real-time help
Documentation: Review the Crawlee documentation
Examples: Study the examples in the repository
Issue comments: Tag maintainers if you need clarification on an issue

Conclusion

Contributing to Crawlee is an excellent way to improve your web scraping skills, collaborate with talented developers, and make a meaningful impact on a widely-used open source project. Whether you're fixing bugs, writing documentation, or building new features, your contributions help make web scraping more accessible to developers worldwide.

Start by exploring the GitHub repository, setting up your development environment, and looking for issues labeled good first issue. Don't hesitate to ask questions—the Crawlee community is welcoming and eager to help new contributors succeed.

Remember, every contribution matters, no matter how small. Your first PR might be fixing a typo, but it's an important step toward becoming an active member of the open source community. Happy contributing!

Table of contents