Table of contents

What are the differences between Crawlee for Python and Crawlee for JavaScript?

Crawlee is available in both JavaScript (TypeScript) and Python implementations, each tailored to their respective ecosystems while maintaining similar core functionality. Understanding the differences between these two versions is crucial for choosing the right tool for your web scraping project.

Language and Ecosystem Fundamentals

JavaScript/TypeScript Version

The JavaScript version of Crawlee was the original implementation and is the most mature. It's written in TypeScript, providing excellent type safety and IDE support. The JavaScript ecosystem offers several advantages:

  • Native async/await support: JavaScript's event loop makes it naturally suited for concurrent web scraping operations
  • Large ecosystem: Access to thousands of npm packages for various scraping needs
  • Browser automation integration: Seamless integration with Puppeteer and Playwright
  • Active development: More frequent updates and feature additions

Python Version

The Python version of Crawlee is a port of the JavaScript version, adapted to Python's idioms and ecosystem:

  • Pythonic syntax: Uses familiar Python patterns and conventions
  • Type hints: Leverages Python's type hinting system for better code completion
  • Scientific computing integration: Easy integration with data science libraries like pandas, NumPy
  • Community familiarity: Appeals to Python developers and data scientists

Installation and Setup

JavaScript/TypeScript

# Install Crawlee with npm
npm install crawlee

# Or with yarn
yarn add crawlee

# Install with specific crawler type
npm install crawlee puppeteer

The JavaScript version requires Node.js 16 or higher and comes with TypeScript definitions built-in.

Python

# Install Crawlee for Python
pip install crawlee

# Install with specific crawler dependencies
pip install 'crawlee[playwright]'
pip install 'crawlee[beautifulsoup]'

The Python version requires Python 3.8 or higher and uses type hints for better IDE support.

API and Syntax Differences

Basic Crawler Setup

JavaScript/TypeScript:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page, request, enqueueLinks }) => {
        const title = await page.title();
        console.log(`Title: ${title} - URL: ${request.url}`);

        // Enqueue all links found on the page
        await enqueueLinks({
            globs: ['https://example.com/**'],
        });
    },
    maxRequestsPerCrawl: 100,
});

await crawler.run(['https://example.com']);

Python:

from crawlee.playwright_crawler import PlaywrightCrawler

async def request_handler(context):
    page = context.page
    request = context.request

    title = await page.title()
    print(f'Title: {title} - URL: {request.url}')

    # Enqueue all links found on the page
    await context.enqueue_links(
        globs=['https://example.com/**']
    )

crawler = PlaywrightCrawler(
    request_handler=request_handler,
    max_requests_per_crawl=100,
)

await crawler.run(['https://example.com'])

Configuration Options

Both versions support similar configuration options, but with syntax differences:

JavaScript:

const crawler = new CheerioCrawler({
    requestHandler: async ({ $, request, enqueueLinks }) => {
        // Handler logic
    },
    maxConcurrency: 50,
    maxRequestsPerCrawl: 1000,
    requestHandlerTimeoutSecs: 60,
    maxRequestRetries: 3,
    sessionPoolOptions: {
        maxPoolSize: 100,
    },
});

Python:

crawler = CheerioCrawler(
    request_handler=request_handler,
    max_concurrency=50,
    max_requests_per_crawl=1000,
    request_handler_timeout=60,  # Note: different parameter name
    max_request_retries=3,
    session_pool_options={
        'max_pool_size': 100,
    },
)

Storage and Data Export

JavaScript Data Storage

import { Dataset } from 'crawlee';

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page, request }) => {
        const data = {
            url: request.url,
            title: await page.title(),
            content: await page.content(),
        };

        // Push data to default dataset
        await Dataset.pushData(data);
    },
});

// Export data after crawling
await crawler.run(['https://example.com']);
const dataset = await Dataset.open();
const data = await dataset.getData();
console.log(data.items);

Python Data Storage

from crawlee.storages import Dataset

async def request_handler(context):
    page = context.page
    request = context.request

    data = {
        'url': request.url,
        'title': await page.title(),
        'content': await page.content(),
    }

    # Push data to default dataset
    await context.push_data(data)

crawler = PlaywrightCrawler(request_handler=request_handler)
await crawler.run(['https://example.com'])

# Access stored data
dataset = await Dataset.open()
data = await dataset.get_data()
print(data.items)

Performance Considerations

JavaScript Performance

  • Event-driven architecture: JavaScript's non-blocking I/O makes it highly efficient for concurrent requests
  • Memory efficiency: Generally uses less memory for similar workloads
  • Faster startup: Node.js typically starts faster than Python
  • V8 optimization: Benefits from Google's highly optimized V8 engine

Python Performance

  • GIL limitations: Python's Global Interpreter Lock can limit true parallelism in CPU-bound tasks
  • AsyncIO overhead: Python's async implementation has more overhead compared to JavaScript
  • Better for data processing: Excels when combined with data analysis libraries
  • Scientific computing: Superior when integrating with machine learning pipelines

Browser Automation Support

Both versions support similar browser automation capabilities, but with slight differences:

JavaScript Browser Support

// Puppeteer integration
import { PuppeteerCrawler } from 'crawlee';

const crawler = new PuppeteerCrawler({
    launchContext: {
        launchOptions: {
            headless: true,
            args: ['--no-sandbox'],
        },
    },
    requestHandler: async ({ page }) => {
        // Wait for dynamic content
        await page.waitForSelector('.dynamic-content');
        const content = await page.$eval('.dynamic-content', el => el.textContent);
    },
});

Python Browser Support

from crawlee.playwright_crawler import PlaywrightCrawler

crawler = PlaywrightCrawler(
    launch_context={
        'launch_options': {
            'headless': True,
            'args': ['--no-sandbox'],
        },
    },
    request_handler=request_handler,
)

async def request_handler(context):
    page = context.page
    # Wait for dynamic content
    await page.wait_for_selector('.dynamic-content')
    content = await page.eval_on_selector('.dynamic-content', 'el => el.textContent')

Both implementations support handling browser sessions and complex browser automation scenarios.

Proxy and Session Management

JavaScript Proxy Configuration

import { ProxyConfiguration } from 'crawlee';

const proxyConfiguration = new ProxyConfiguration({
    proxyUrls: [
        'http://proxy1.example.com:8000',
        'http://proxy2.example.com:8000',
    ],
});

const crawler = new PlaywrightCrawler({
    proxyConfiguration,
    useSessionPool: true,
    sessionPoolOptions: {
        sessionOptions: {
            maxUsageCount: 50,
        },
    },
});

Python Proxy Configuration

from crawlee.proxy_configuration import ProxyConfiguration

proxy_configuration = ProxyConfiguration(
    proxy_urls=[
        'http://proxy1.example.com:8000',
        'http://proxy2.example.com:8000',
    ],
)

crawler = PlaywrightCrawler(
    proxy_configuration=proxy_configuration,
    use_session_pool=True,
    session_pool_options={
        'session_options': {
            'max_usage_count': 50,
        },
    },
)

Error Handling and Retries

JavaScript Error Handling

const crawler = new CheerioCrawler({
    requestHandler: async ({ $, request }) => {
        try {
            // Scraping logic
        } catch (error) {
            console.error(`Error processing ${request.url}:`, error);
            throw error; // Re-throw to trigger retry
        }
    },
    failedRequestHandler: async ({ request, error }) => {
        console.log(`Request ${request.url} failed: ${error.message}`);
    },
    maxRequestRetries: 3,
});

Python Error Handling

async def request_handler(context):
    try:
        # Scraping logic
        pass
    except Exception as error:
        print(f'Error processing {context.request.url}: {error}')
        raise  # Re-throw to trigger retry

async def failed_request_handler(context):
    print(f'Request {context.request.url} failed: {context.error}')

crawler = CheerioCrawler(
    request_handler=request_handler,
    failed_request_handler=failed_request_handler,
    max_request_retries=3,
)

Feature Parity and Maturity

JavaScript (More Mature)

  • AutoscaledPool: Advanced request queue management
  • Fingerprint generation: Better bot detection avoidance
  • Request interception: More granular control over network requests
  • Plugin ecosystem: Larger collection of community plugins
  • Documentation: More comprehensive and up-to-date

Python (Catching Up)

  • Core functionality: Most essential features are implemented
  • Pythonic patterns: Better integration with Python ecosystem
  • Data science integration: Easier to combine with pandas, NumPy
  • Growing community: Actively developing new features
  • Type hints: Excellent IDE support through type annotations

When to Choose Each Version

Choose JavaScript/TypeScript When:

  1. Performance is critical: You need maximum concurrency and minimal overhead
  2. Browser automation focus: Heavy reliance on Puppeteer or Playwright
  3. Cutting-edge features: You want access to the latest Crawlee features
  4. Existing Node.js infrastructure: You're already working in a JavaScript environment
  5. Large-scale scraping: You need to handle thousands of concurrent requests

Choose Python When:

  1. Data science integration: You plan to analyze scraped data with pandas, scikit-learn
  2. Team expertise: Your team is more comfortable with Python
  3. Rapid prototyping: You want to quickly test scraping strategies
  4. ML pipelines: Integrating scraping with machine learning workflows
  5. Scientific computing: Working with numerical data or research applications

Best Practices for Each Implementation

JavaScript Best Practices

// Use TypeScript for type safety
import { PlaywrightCrawler, Dataset } from 'crawlee';

// Define types for scraped data
interface ProductData {
    name: string;
    price: number;
    url: string;
}

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page, request, log }) => {
        // Use structured logging
        log.info(`Processing ${request.url}`);

        // Type-safe data extraction
        const data: ProductData = {
            name: await page.$eval('.product-name', el => el.textContent || ''),
            price: parseFloat(await page.$eval('.price', el => el.textContent || '0')),
            url: request.url,
        };

        await Dataset.pushData(data);
    },
});

Python Best Practices

from crawlee.playwright_crawler import PlaywrightCrawler
from typing import TypedDict
import pandas as pd

# Define typed data structure
class ProductData(TypedDict):
    name: str
    price: float
    url: str

async def request_handler(context):
    page = context.page
    request = context.request
    log = context.log

    # Use structured logging
    log.info(f'Processing {request.url}')

    # Type-safe data extraction
    data: ProductData = {
        'name': await page.eval_on_selector('.product-name', 'el => el.textContent') or '',
        'price': float(await page.eval_on_selector('.price', 'el => el.textContent') or '0'),
        'url': request.url,
    }

    await context.push_data(data)

# Easy pandas integration
crawler = PlaywrightCrawler(request_handler=request_handler)
await crawler.run(['https://example.com'])

# Convert to DataFrame for analysis
dataset = await Dataset.open()
data = await dataset.get_data()
df = pd.DataFrame(data.items)

Conclusion

Both Crawlee implementations provide powerful web scraping capabilities, and the choice largely depends on your project requirements and team expertise. The JavaScript version offers better performance and more mature features, while the Python version provides excellent integration with data science tools and a more familiar syntax for Python developers. Both versions continue to evolve, with the Python implementation steadily catching up to feature parity with JavaScript.

For most production web scraping applications requiring maximum performance and concurrency, the JavaScript version is recommended. For data science projects, rapid prototyping, or teams with strong Python expertise, the Python version is an excellent choice.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon