What are the differences between Crawlee for Python and Crawlee for JavaScript?

Crawlee is available in both JavaScript (TypeScript) and Python implementations, each tailored to their respective ecosystems while maintaining similar core functionality. Understanding the differences between these two versions is crucial for choosing the right tool for your web scraping project.

Language and Ecosystem Fundamentals

JavaScript/TypeScript Version

The JavaScript version of Crawlee was the original implementation and is the most mature. It's written in TypeScript, providing excellent type safety and IDE support. The JavaScript ecosystem offers several advantages:

Native async/await support: JavaScript's event loop makes it naturally suited for concurrent web scraping operations
Large ecosystem: Access to thousands of npm packages for various scraping needs
Browser automation integration: Seamless integration with Puppeteer and Playwright
Active development: More frequent updates and feature additions

Python Version

The Python version of Crawlee is a port of the JavaScript version, adapted to Python's idioms and ecosystem:

Pythonic syntax: Uses familiar Python patterns and conventions
Type hints: Leverages Python's type hinting system for better code completion
Scientific computing integration: Easy integration with data science libraries like pandas, NumPy
Community familiarity: Appeals to Python developers and data scientists

Installation and Setup

JavaScript/TypeScript

# Install Crawlee with npm
npm install crawlee

# Or with yarn
yarn add crawlee

# Install with specific crawler type
npm install crawlee puppeteer

The JavaScript version requires Node.js 16 or higher and comes with TypeScript definitions built-in.

Python

# Install Crawlee for Python
pip install crawlee

# Install with specific crawler dependencies
pip install 'crawlee[playwright]'
pip install 'crawlee[beautifulsoup]'

The Python version requires Python 3.8 or higher and uses type hints for better IDE support.

API and Syntax Differences

Basic Crawler Setup

JavaScript/TypeScript:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page, request, enqueueLinks }) => {
        const title = await page.title();
        console.log(`Title: ${title} - URL: ${request.url}`);

        // Enqueue all links found on the page
        await enqueueLinks({
            globs: ['https://example.com/**'],
        });
    },
    maxRequestsPerCrawl: 100,
});

await crawler.run(['https://example.com']);

Python:

from crawlee.playwright_crawler import PlaywrightCrawler

async def request_handler(context):
    page = context.page
    request = context.request

    title = await page.title()
    print(f'Title: {title} - URL: {request.url}')

    # Enqueue all links found on the page
    await context.enqueue_links(
        globs=['https://example.com/**']
    )

crawler = PlaywrightCrawler(
    request_handler=request_handler,
    max_requests_per_crawl=100,
)

await crawler.run(['https://example.com'])

Configuration Options

Both versions support similar configuration options, but with syntax differences:

JavaScript:

const crawler = new CheerioCrawler({
    requestHandler: async ({ $, request, enqueueLinks }) => {
        // Handler logic
    },
    maxConcurrency: 50,
    maxRequestsPerCrawl: 1000,
    requestHandlerTimeoutSecs: 60,
    maxRequestRetries: 3,
    sessionPoolOptions: {
        maxPoolSize: 100,
    },
});

Python:

crawler = CheerioCrawler(
    request_handler=request_handler,
    max_concurrency=50,
    max_requests_per_crawl=1000,
    request_handler_timeout=60,  # Note: different parameter name
    max_request_retries=3,
    session_pool_options={
        'max_pool_size': 100,
    },
)

Storage and Data Export

JavaScript Data Storage

import { Dataset } from 'crawlee';

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page, request }) => {
        const data = {
            url: request.url,
            title: await page.title(),
            content: await page.content(),
        };

        // Push data to default dataset
        await Dataset.pushData(data);
    },
});

// Export data after crawling
await crawler.run(['https://example.com']);
const dataset = await Dataset.open();
const data = await dataset.getData();
console.log(data.items);

Python Data Storage

from crawlee.storages import Dataset

async def request_handler(context):
    page = context.page
    request = context.request

    data = {
        'url': request.url,
        'title': await page.title(),
        'content': await page.content(),
    }

    # Push data to default dataset
    await context.push_data(data)

crawler = PlaywrightCrawler(request_handler=request_handler)
await crawler.run(['https://example.com'])

# Access stored data
dataset = await Dataset.open()
data = await dataset.get_data()
print(data.items)

Performance Considerations

JavaScript Performance

Event-driven architecture: JavaScript's non-blocking I/O makes it highly efficient for concurrent requests
Memory efficiency: Generally uses less memory for similar workloads
Faster startup: Node.js typically starts faster than Python
V8 optimization: Benefits from Google's highly optimized V8 engine

Python Performance

GIL limitations: Python's Global Interpreter Lock can limit true parallelism in CPU-bound tasks
AsyncIO overhead: Python's async implementation has more overhead compared to JavaScript
Better for data processing: Excels when combined with data analysis libraries
Scientific computing: Superior when integrating with machine learning pipelines

Browser Automation Support

Both versions support similar browser automation capabilities, but with slight differences:

JavaScript Browser Support

// Puppeteer integration
import { PuppeteerCrawler } from 'crawlee';

const crawler = new PuppeteerCrawler({
    launchContext: {
        launchOptions: {
            headless: true,
            args: ['--no-sandbox'],
        },
    },
    requestHandler: async ({ page }) => {
        // Wait for dynamic content
        await page.waitForSelector('.dynamic-content');
        const content = await page.$eval('.dynamic-content', el => el.textContent);
    },
});

Python Browser Support

from crawlee.playwright_crawler import PlaywrightCrawler

crawler = PlaywrightCrawler(
    launch_context={
        'launch_options': {
            'headless': True,
            'args': ['--no-sandbox'],
        },
    },
    request_handler=request_handler,
)

async def request_handler(context):
    page = context.page
    # Wait for dynamic content
    await page.wait_for_selector('.dynamic-content')
    content = await page.eval_on_selector('.dynamic-content', 'el => el.textContent')

Both implementations support handling browser sessions and complex browser automation scenarios.

Proxy and Session Management

JavaScript Proxy Configuration

import { ProxyConfiguration } from 'crawlee';

const proxyConfiguration = new ProxyConfiguration({
    proxyUrls: [
        'http://proxy1.example.com:8000',
        'http://proxy2.example.com:8000',
    ],
});

const crawler = new PlaywrightCrawler({
    proxyConfiguration,
    useSessionPool: true,
    sessionPoolOptions: {
        sessionOptions: {
            maxUsageCount: 50,
        },
    },
});

Python Proxy Configuration

from crawlee.proxy_configuration import ProxyConfiguration

proxy_configuration = ProxyConfiguration(
    proxy_urls=[
        'http://proxy1.example.com:8000',
        'http://proxy2.example.com:8000',
    ],
)

crawler = PlaywrightCrawler(
    proxy_configuration=proxy_configuration,
    use_session_pool=True,
    session_pool_options={
        'session_options': {
            'max_usage_count': 50,
        },
    },
)

Error Handling and Retries

JavaScript Error Handling

const crawler = new CheerioCrawler({
    requestHandler: async ({ $, request }) => {
        try {
            // Scraping logic
        } catch (error) {
            console.error(`Error processing ${request.url}:`, error);
            throw error; // Re-throw to trigger retry
        }
    },
    failedRequestHandler: async ({ request, error }) => {
        console.log(`Request ${request.url} failed: ${error.message}`);
    },
    maxRequestRetries: 3,
});

Python Error Handling

async def request_handler(context):
    try:
        # Scraping logic
        pass
    except Exception as error:
        print(f'Error processing {context.request.url}: {error}')
        raise  # Re-throw to trigger retry

async def failed_request_handler(context):
    print(f'Request {context.request.url} failed: {context.error}')

crawler = CheerioCrawler(
    request_handler=request_handler,
    failed_request_handler=failed_request_handler,
    max_request_retries=3,
)

Feature Parity and Maturity

JavaScript (More Mature)

AutoscaledPool: Advanced request queue management
Fingerprint generation: Better bot detection avoidance
Request interception: More granular control over network requests
Plugin ecosystem: Larger collection of community plugins
Documentation: More comprehensive and up-to-date

Python (Catching Up)

Core functionality: Most essential features are implemented
Pythonic patterns: Better integration with Python ecosystem
Data science integration: Easier to combine with pandas, NumPy
Growing community: Actively developing new features
Type hints: Excellent IDE support through type annotations

When to Choose Each Version

Choose JavaScript/TypeScript When:

Performance is critical: You need maximum concurrency and minimal overhead
Browser automation focus: Heavy reliance on Puppeteer or Playwright
Cutting-edge features: You want access to the latest Crawlee features
Existing Node.js infrastructure: You're already working in a JavaScript environment
Large-scale scraping: You need to handle thousands of concurrent requests

Choose Python When:

Data science integration: You plan to analyze scraped data with pandas, scikit-learn
Team expertise: Your team is more comfortable with Python
Rapid prototyping: You want to quickly test scraping strategies
ML pipelines: Integrating scraping with machine learning workflows
Scientific computing: Working with numerical data or research applications

Best Practices for Each Implementation

JavaScript Best Practices

// Use TypeScript for type safety
import { PlaywrightCrawler, Dataset } from 'crawlee';

// Define types for scraped data
interface ProductData {
    name: string;
    price: number;
    url: string;
}

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page, request, log }) => {
        // Use structured logging
        log.info(`Processing ${request.url}`);

        // Type-safe data extraction
        const data: ProductData = {
            name: await page.$eval('.product-name', el => el.textContent || ''),
            price: parseFloat(await page.$eval('.price', el => el.textContent || '0')),
            url: request.url,
        };

        await Dataset.pushData(data);
    },
});

Python Best Practices

from crawlee.playwright_crawler import PlaywrightCrawler
from typing import TypedDict
import pandas as pd

# Define typed data structure
class ProductData(TypedDict):
    name: str
    price: float
    url: str

async def request_handler(context):
    page = context.page
    request = context.request
    log = context.log

    # Use structured logging
    log.info(f'Processing {request.url}')

    # Type-safe data extraction
    data: ProductData = {
        'name': await page.eval_on_selector('.product-name', 'el => el.textContent') or '',
        'price': float(await page.eval_on_selector('.price', 'el => el.textContent') or '0'),
        'url': request.url,
    }

    await context.push_data(data)

# Easy pandas integration
crawler = PlaywrightCrawler(request_handler=request_handler)
await crawler.run(['https://example.com'])

# Convert to DataFrame for analysis
dataset = await Dataset.open()
data = await dataset.get_data()
df = pd.DataFrame(data.items)

Conclusion

Both Crawlee implementations provide powerful web scraping capabilities, and the choice largely depends on your project requirements and team expertise. The JavaScript version offers better performance and more mature features, while the Python version provides excellent integration with data science tools and a more familiar syntax for Python developers. Both versions continue to evolve, with the Python implementation steadily catching up to feature parity with JavaScript.

For most production web scraping applications requiring maximum performance and concurrency, the JavaScript version is recommended. For data science projects, rapid prototyping, or teams with strong Python expertise, the Python version is an excellent choice.

Table of contents