How do I contribute to Crawlee open source project?
Contributing to the Crawlee open source project is a rewarding way to give back to the web scraping community, improve your coding skills, and collaborate with developers worldwide. Crawlee is actively maintained by Apify and welcomes contributions from developers of all skill levels. This guide will walk you through everything you need to know to start contributing.
Understanding the Crawlee Project Structure
Crawlee is a web scraping and browser automation library available for both Node.js/TypeScript and Python. The project is hosted on GitHub and consists of multiple packages organized in a monorepo structure.
The main repository is located at https://github.com/apify/crawlee
, which contains:
- Core packages: The fundamental crawling functionality
- Browser crawlers: Integration with Puppeteer, Playwright, and other browser automation tools
- HTTP crawlers: Lightweight crawlers for static content
- Utilities: Helper functions and tools for web scraping tasks
- Documentation: Comprehensive guides and API references
- Examples: Sample projects demonstrating various use cases
Setting Up Your Development Environment
Before you can contribute code, you need to set up your local development environment.
Fork and Clone the Repository
Start by forking the Crawlee repository to your GitHub account, then clone it locally:
# Clone your fork
git clone https://github.com/YOUR_USERNAME/crawlee.git
cd crawlee
# Add the upstream repository
git remote add upstream https://github.com/apify/crawlee.git
Install Dependencies
Crawlee uses npm workspaces for managing its monorepo structure. Install all dependencies:
# Install all dependencies
npm install
# Build all packages
npm run build
For Python contributions, you'll need to set up a Python environment:
# Create a virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -e ".[dev]"
Verify Your Setup
Run the test suite to ensure everything is working correctly:
# Run tests for JavaScript/TypeScript
npm test
# Run tests for a specific package
npm test -w packages/core
# Run Python tests
pytest
Types of Contributions
There are several ways you can contribute to Crawlee, regardless of your experience level.
Code Contributions
Code contributions include bug fixes, new features, performance improvements, and refactoring. Here's how to approach them:
1. Find an Issue or Propose a Feature
Browse the GitHub Issues page to find open issues labeled as good first issue
or help wanted
. These are great starting points for new contributors.
# Create a new branch for your work
git checkout -b fix/issue-description
# Or for a feature
git checkout -b feat/feature-description
2. Write Clean, Tested Code
Crawlee maintains high code quality standards. When writing code:
// Example: Adding a new method to CheerioCrawler
import { CheerioCrawler } from 'crawlee';
// Always add TypeScript types
interface CustomHandlerOptions {
maxRetries: number;
timeout: number;
}
// Document your code with JSDoc comments
/**
* Custom request handler with retry logic
* @param options - Configuration options for the handler
* @returns A configured request handler function
*/
export function createCustomHandler(options: CustomHandlerOptions) {
return async ({ request, $, enqueueLinks }) => {
// Your implementation here
const { maxRetries, timeout } = options;
// Extract data
const title = $('h1').text();
// Enqueue discovered links
await enqueueLinks({
selector: 'a[href]',
label: 'DETAIL',
});
return { title };
};
}
3. Add Tests
Every code contribution should include appropriate tests:
// Example test file: custom-handler.test.ts
import { describe, it, expect } from 'vitest';
import { createCustomHandler } from './custom-handler';
describe('createCustomHandler', () => {
it('should create a handler with correct options', async () => {
const handler = createCustomHandler({
maxRetries: 3,
timeout: 5000,
});
expect(handler).toBeInstanceOf(Function);
});
it('should extract title correctly', async () => {
const handler = createCustomHandler({
maxRetries: 3,
timeout: 5000,
});
const mockContext = {
request: { url: 'https://example.com' },
$: cheerio.load('<h1>Test Title</h1>'),
enqueueLinks: jest.fn(),
};
const result = await handler(mockContext);
expect(result.title).toBe('Test Title');
});
});
Documentation Contributions
Documentation is crucial for any open source project. You can contribute by:
- Fixing typos and grammatical errors
- Improving existing documentation clarity
- Adding examples and tutorials
- Translating documentation to other languages
- Creating video tutorials or blog posts
Documentation files are typically located in the docs/
directory and written in Markdown:
# Example Documentation Addition
## Using Custom Storage
Crawlee allows you to configure custom storage backends for datasets, key-value stores, and request queues.
\`\`\`javascript
import { CheerioCrawler, Configuration } from 'crawlee';
const crawler = new CheerioCrawler({
requestHandler: async ({ request, $ }) => {
// Your scraping logic
},
});
await crawler.run(['https://example.com']);
\`\`\`
This approach is particularly useful when [handling browser sessions in Puppeteer](/faq/puppeteer/how-to-handle-browser-sessions-in-puppeteer) for authenticated scraping scenarios.
Reporting Bugs
High-quality bug reports help maintainers fix issues quickly. A good bug report includes:
- Clear title: Describe the issue concisely
- Environment details: Node.js version, Crawlee version, operating system
- Steps to reproduce: Minimal code example that demonstrates the bug
- Expected behavior: What should happen
- Actual behavior: What actually happens
- Additional context: Error messages, stack traces, screenshots
// Example minimal reproducible code
import { PuppeteerCrawler } from 'crawlee';
const crawler = new PuppeteerCrawler({
requestHandler: async ({ page }) => {
// This causes the bug
await page.waitForSelector('.non-existent-selector', {
timeout: 1000,
});
},
});
await crawler.run(['https://example.com']);
Making Your First Pull Request
Once you've made your changes, it's time to submit a pull request (PR).
Pre-commit Checklist
Before submitting, ensure:
# Run linter
npm run lint
# Fix linting issues automatically
npm run lint:fix
# Run tests
npm test
# Build the project
npm run build
# Check types
npm run type-check
Commit Message Guidelines
Crawlee follows conventional commit message format:
# Feature
git commit -m "feat: add retry mechanism to PlaywrightCrawler"
# Bug fix
git commit -m "fix: resolve memory leak in request queue"
# Documentation
git commit -m "docs: update CheerioCrawler examples"
# Refactoring
git commit -m "refactor: simplify session pool logic"
# Tests
git commit -m "test: add integration tests for proxy rotation"
Create the Pull Request
Push your changes and create a PR:
# Push to your fork
git push origin your-branch-name
Visit GitHub and click "New Pull Request". In your PR description:
- Reference related issues (e.g., "Fixes #123")
- Describe what changed and why
- Highlight any breaking changes
- Add screenshots or examples if applicable
Code Review Process
After submitting your PR:
- Automated checks: CI/CD pipelines will run tests and linting
- Maintainer review: A project maintainer will review your code
- Feedback: You may receive comments requesting changes
- Iteration: Make requested changes and push updates
- Approval: Once approved, a maintainer will merge your PR
Be patient and responsive during code review. When handling errors in Puppeteer or similar complex scenarios, reviewers may suggest alternative approaches that improve reliability.
Contributing to the Community
Code isn't the only way to contribute. You can also:
Answer Questions
- Participate in GitHub Discussions
- Help others on Stack Overflow with the
crawlee
tag - Join the Apify Discord server and answer questions
- Share your knowledge on social media and developer forums
Create Examples and Tutorials
// Example: E-commerce scraper tutorial
import { PlaywrightCrawler, Dataset } from 'crawlee';
const crawler = new PlaywrightCrawler({
requestHandler: async ({ page, request, enqueueLinks }) => {
// Navigate to product listing
if (request.label === 'CATEGORY') {
await enqueueLinks({
selector: '.product-card a',
label: 'PRODUCT',
});
}
// Scrape product details
if (request.label === 'PRODUCT') {
const title = await page.locator('h1.product-title').textContent();
const price = await page.locator('.price').textContent();
const description = await page.locator('.description').textContent();
await Dataset.pushData({
url: request.url,
title,
price,
description,
});
}
},
});
await crawler.run([{
url: 'https://example-shop.com/category',
label: 'CATEGORY',
}]);
This type of practical example helps developers understand how to crawl single page applications using Puppeteer and adapt similar techniques with Crawlee.
Improve Tooling and Infrastructure
Help improve the development experience by contributing to:
- Build scripts and CI/CD pipelines
- Development environment setup
- Testing infrastructure
- Documentation generation tools
Best Practices for Contributors
- Start small: Begin with documentation fixes or simple bug fixes before tackling major features
- Communicate early: Open an issue or discussion before starting work on significant changes
- Follow coding standards: Use the project's linting rules and style guide
- Write comprehensive tests: Aim for high test coverage on new code
- Update documentation: Always update relevant docs when changing functionality
- Be respectful: Follow the project's code of conduct and maintain professional communication
- Stay updated: Regularly sync your fork with the upstream repository
# Keep your fork updated
git fetch upstream
git checkout master
git merge upstream/master
git push origin master
Getting Help
If you need assistance:
- GitHub Discussions: Ask questions about contributing
- Discord: Join the Apify Discord server for real-time help
- Documentation: Review the Crawlee documentation
- Examples: Study the examples in the repository
- Issue comments: Tag maintainers if you need clarification on an issue
Conclusion
Contributing to Crawlee is an excellent way to improve your web scraping skills, collaborate with talented developers, and make a meaningful impact on a widely-used open source project. Whether you're fixing bugs, writing documentation, or building new features, your contributions help make web scraping more accessible to developers worldwide.
Start by exploring the GitHub repository, setting up your development environment, and looking for issues labeled good first issue
. Don't hesitate to ask questions—the Crawlee community is welcoming and eager to help new contributors succeed.
Remember, every contribution matters, no matter how small. Your first PR might be fixing a typo, but it's an important step toward becoming an active member of the open source community. Happy contributing!