Where is the Crawlee GitHub Repository?
The official Crawlee GitHub repository is located at https://github.com/apify/crawlee. This is the primary source for Crawlee's codebase, documentation, examples, and community contributions. The repository is actively maintained by Apify and the open-source community, making it an essential resource for developers working with this powerful web scraping and browser automation framework.
Overview of the Crawlee Repository
Crawlee is an open-source web scraping and browser automation library available for both JavaScript/TypeScript (Node.js) and Python. The GitHub repository serves as the central hub for all Crawlee-related resources, including:
- Source code for the Crawlee library
- Comprehensive documentation and API references
- Example projects demonstrating various use cases
- Issue tracker for bug reports and feature requests
- Discussion forums for community support
- Contribution guidelines for developers who want to contribute
Repository Structure
The Crawlee monorepo is organized into multiple packages and versions:
JavaScript/TypeScript Version
The main branch contains the Node.js implementation of Crawlee, which includes several specialized crawler packages:
# Clone the repository
git clone https://github.com/apify/crawlee.git
cd crawlee
# Install dependencies
npm install
# Build all packages
npm run build
The repository includes these core packages:
crawlee
- The main package with all crawler types@crawlee/core
- Core functionality shared across all crawlers@crawlee/cheerio
- Fast HTML crawler using Cheerio@crawlee/puppeteer
- Browser automation with Puppeteer@crawlee/playwright
- Browser automation with Playwright@crawlee/jsdom
- DOM manipulation using JSDOM@crawlee/http
- Simple HTTP crawler
Python Version
The Python implementation is located in a separate directory within the repository:
# Navigate to Python implementation
cd crawlee/crawlee-python
# Install in development mode
pip install -e .
Key Features Available in the Repository
1. Example Projects
The repository contains numerous example projects in the examples
directory that demonstrate common web scraping scenarios:
// Example: Basic CheerioCrawler usage from the repository
import { CheerioCrawler } from 'crawlee';
const crawler = new CheerioCrawler({
requestHandler: async ({ request, $, enqueueLinks }) => {
const title = $('title').text();
console.log(`Title of ${request.url}: ${title}`);
// Enqueue all links on the page
await enqueueLinks({
globs: ['https://example.com/**'],
});
},
});
await crawler.run(['https://example.com']);
For scenarios requiring JavaScript rendering, you can find examples using browser automation similar to Puppeteer:
// Example: PuppeteerCrawler from repository examples
import { PuppeteerCrawler } from 'crawlee';
const crawler = new PuppeteerCrawler({
requestHandler: async ({ page, request, enqueueLinks }) => {
// Wait for dynamic content to load
await page.waitForSelector('.dynamic-content');
const title = await page.title();
console.log(`Page title: ${title}`);
// Enqueue additional pages
await enqueueLinks();
},
});
await crawler.run(['https://example.com']);
2. Documentation
The repository includes comprehensive documentation in markdown format:
- Getting Started guides for beginners
- API Reference with detailed parameter descriptions
- Migration guides for upgrading between versions
- Best practices for efficient web scraping
3. Issue Tracking and Bug Reports
The GitHub Issues section allows you to:
- Report bugs with detailed reproduction steps
- Request new features
- Track the status of known issues
- Search for solutions to common problems
# Search for issues related to proxy configuration
# Visit: https://github.com/apify/crawlee/issues?q=is%3Aissue+proxy
How to Contribute to Crawlee
The repository welcomes contributions from the community. Here's how to get started:
1. Fork and Clone
# Fork the repository on GitHub, then clone your fork
git clone https://github.com/YOUR_USERNAME/crawlee.git
cd crawlee
# Add upstream remote
git remote add upstream https://github.com/apify/crawlee.git
2. Set Up Development Environment
# Install dependencies
npm install
# Run tests
npm test
# Run linting
npm run lint
3. Create a Pull Request
# Create a new branch
git checkout -b feature/my-new-feature
# Make your changes and commit
git add .
git commit -m "Add new feature"
# Push to your fork
git push origin feature/my-new-feature
Then open a pull request on GitHub with a clear description of your changes.
Repository Resources
Package Versions and Releases
The repository maintains detailed release notes for each version:
# View all releases
# Visit: https://github.com/apify/crawlee/releases
# Install specific version
npm install crawlee@3.5.0
Community and Support
- GitHub Discussions: Ask questions and share ideas
- Discord Server: Real-time chat with the community (link in repository README)
- Stack Overflow: Tag questions with
crawlee
TypeScript Support
Crawlee is written in TypeScript, providing excellent type safety. The repository includes all type definitions:
import { CheerioCrawler, Dataset } from 'crawlee';
interface ProductData {
title: string;
price: number;
url: string;
}
const crawler = new CheerioCrawler({
requestHandler: async ({ request, $, log }) => {
const products: ProductData[] = [];
$('.product').each((_, element) => {
const title = $(element).find('.title').text();
const price = parseFloat($(element).find('.price').text());
products.push({
title,
price,
url: request.url,
});
});
// Save to dataset
await Dataset.pushData(products);
},
});
Advanced Features from the Repository
Session Management
Crawlee includes sophisticated session handling and authentication capabilities:
import { CheerioCrawler, SessionPool } from 'crawlee';
const sessionPool = new SessionPool({
maxPoolSize: 50,
sessionOptions: {
maxUsageCount: 100,
},
});
const crawler = new CheerioCrawler({
sessionPoolOptions: {
sessionOptions: {
maxUsageCount: 100,
},
},
requestHandler: async ({ request, session }) => {
console.log(`Using session: ${session.id}`);
// Your scraping logic here
},
});
Request Queue Management
The repository provides robust queue management for large-scale scraping:
import { CheerioCrawler, RequestQueue } from 'crawlee';
const requestQueue = await RequestQueue.open();
// Add initial URLs
await requestQueue.addRequest({ url: 'https://example.com' });
await requestQueue.addRequest({ url: 'https://example.com/products' });
const crawler = new CheerioCrawler({
requestQueue,
requestHandler: async ({ request, enqueueLinks }) => {
// Process page and enqueue more URLs
await enqueueLinks({
globs: ['https://example.com/products/*'],
});
},
});
await crawler.run();
Error Handling and Retries
Crawlee includes built-in error handling with configurable retry logic:
import { CheerioCrawler } from 'crawlee';
const crawler = new CheerioCrawler({
maxRequestRetries: 5,
maxConcurrency: 10,
requestHandlerTimeoutSecs: 60,
requestHandler: async ({ request, log }) => {
log.info(`Processing ${request.url}`);
// Your scraping logic
},
failedRequestHandler: async ({ request, log }) => {
log.error(`Request ${request.url} failed after retries`);
},
});
Python Repository Usage
For Python developers, the repository includes a full Python implementation:
from crawlee.playwright_crawler import PlaywrightCrawler
from crawlee.router import Router
router = Router()
@router.default_handler
async def default_handler(context):
"""Default request handler."""
page = context.page
await page.wait_for_selector('.content')
title = await page.title()
print(f'Page title: {title}')
# Enqueue additional links
await context.enqueue_links()
async def main():
crawler = PlaywrightCrawler(
request_handler=router,
max_requests_per_crawl=100,
)
await crawler.run(['https://example.com'])
if __name__ == '__main__':
import asyncio
asyncio.run(main())
Staying Updated
To stay current with Crawlee development:
Watch the Repository
Click the "Watch" button on GitHub to receive notifications about: - New releases - Important security updates - Breaking changes
Follow Release Notes
# Check latest release
# Visit: https://github.com/apify/crawlee/releases/latest
# View changelog
# Visit: https://github.com/apify/crawlee/blob/master/CHANGELOG.md
Subscribe to Newsletter
The Apify team maintains a newsletter with updates about Crawlee and web scraping best practices.
Conclusion
The Crawlee GitHub repository at https://github.com/apify/crawlee is the definitive resource for everything related to this powerful web scraping framework. Whether you're looking for source code, documentation, examples, or community support, the repository provides comprehensive resources for developers at all skill levels. By engaging with the repository—whether through using the code, reporting issues, or contributing improvements—you become part of a vibrant community dedicated to making web scraping more accessible and efficient.
The active development and maintenance of the repository ensure that Crawlee continues to evolve with modern web scraping needs, incorporating new features, performance improvements, and security updates regularly. Make sure to star the repository to bookmark it for future reference and to show your support for this excellent open-source project.