Table of contents

How do I install Crawlee for my web scraping project?

Crawlee is a powerful web scraping and browser automation library available for both Node.js and Python. It provides a robust framework for building reliable crawlers with automatic scaling, proxy rotation, and smart request management. This guide covers all the installation methods and initial setup steps.

Prerequisites

Before installing Crawlee, ensure you have the following prerequisites installed on your system:

For Node.js (Crawlee for JavaScript/TypeScript)

  • Node.js: Version 16.0 or higher
  • npm or yarn: Package manager (comes with Node.js)

Check your Node.js version:

node --version

For Python (Crawlee for Python)

  • Python: Version 3.8 or higher
  • pip: Python package manager

Check your Python version:

python --version
# or
python3 --version

Installing Crawlee for Node.js

There are several ways to install and use Crawlee in your Node.js projects.

Method 1: Using NPM (Recommended for Projects)

If you're building a complete web scraping project, install Crawlee as a dependency:

npm install crawlee

For TypeScript projects, Crawlee comes with built-in TypeScript definitions, so no additional type packages are needed.

If you need to use Playwright for browser automation (recommended for modern websites):

npm install crawlee playwright

Or if you prefer Puppeteer:

npm install crawlee puppeteer

For lightweight HTTP requests without a browser:

npm install crawlee cheerio

Method 2: Using Yarn

yarn add crawlee

With browser automation:

yarn add crawlee playwright
# or
yarn add crawlee puppeteer

Method 3: Using npx (Quick Start)

For quick experimentation or one-off scripts, you can create a Crawlee project using the official template:

npx crawlee create my-crawler

This interactive command will: 1. Ask you to choose a template (Playwright, Puppeteer, Cheerio, or Jsdom) 2. Set up a complete project structure 3. Install all dependencies automatically

Method 4: Using Crawlee CLI

Install Crawlee CLI globally for easy project creation:

npm install -g crawlee-cli

Then create new projects:

crawlee create my-crawler

Installing Crawlee for Python

Crawlee for Python provides a Pythonic interface to the same powerful crawling features.

Using pip

pip install crawlee

If you want to use Playwright for browser automation:

pip install crawlee[playwright]

After installation, install Playwright browsers:

playwright install

For HTTP-only crawling with BeautifulSoup:

pip install crawlee[beautifulsoup]

Using pip with Virtual Environment (Recommended)

It's best practice to use a virtual environment:

# Create virtual environment
python -m venv venv

# Activate virtual environment
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate

# Install Crawlee
pip install crawlee[playwright]

# Install Playwright browsers
playwright install

Installing with Docker

For containerized deployments, you can use Docker to run Crawlee projects.

Node.js Dockerfile Example

FROM node:18-slim

# Install Playwright dependencies
RUN apt-get update && apt-get install -y \
    wget \
    gnupg \
    ca-certificates \
    fonts-liberation \
    libappindicator3-1 \
    libasound2 \
    libatk-bridge2.0-0 \
    libatk1.0-0 \
    libcups2 \
    libdbus-1-3 \
    libgdk-pixbuf2.0-0 \
    libnspr4 \
    libnss3 \
    libx11-xcb1 \
    libxcomposite1 \
    libxdamage1 \
    libxrandr2 \
    xdg-utils \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app

COPY package*.json ./
RUN npm install

COPY . .

CMD ["node", "src/main.js"]

Python Dockerfile Example

FROM python:3.11-slim

# Install Playwright dependencies
RUN apt-get update && apt-get install -y \
    wget \
    gnupg \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Install Playwright browsers
RUN playwright install --with-deps chromium

COPY . .

CMD ["python", "main.py"]

Verifying Your Installation

After installation, verify that Crawlee is working correctly.

Node.js Verification

Create a test file test.js:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page, request }) => {
        console.log(`Processing ${request.url}...`);
        const title = await page.title();
        console.log(`Page title: ${title}`);
    },
});

await crawler.run(['https://crawlee.dev']);
console.log('Crawlee is working!');

Run it:

node test.js

Python Verification

Create a test file test.py:

import asyncio
from crawlee.playwright_crawler import PlaywrightCrawler

async def main():
    crawler = PlaywrightCrawler()

    @crawler.router.default_handler
    async def request_handler(context):
        print(f'Processing {context.request.url}...')
        title = await context.page.title()
        print(f'Page title: {title}')

    await crawler.run(['https://crawlee.dev'])
    print('Crawlee is working!')

if __name__ == '__main__':
    asyncio.run(main())

Run it:

python test.py

Choosing the Right Crawlee Configuration

Crawlee supports multiple HTTP clients and browser automation tools. Here's when to use each:

Cheerio Crawler (Node.js) / BeautifulSoup Crawler (Python)

Best for: - Static HTML websites - Fast, lightweight scraping - Sites that don't require JavaScript execution

Installation:

# Node.js
npm install crawlee cheerio

# Python
pip install crawlee[beautifulsoup]

Puppeteer Crawler

Best for: - JavaScript-heavy websites - Chrome/Chromium-specific features - When you need DevTools Protocol access

Installation:

npm install crawlee puppeteer

Playwright Crawler

Best for: - Modern web applications - Cross-browser testing (Chrome, Firefox, Safari) - Advanced browser automation features - Sites with complex JavaScript rendering

Installation:

# Node.js
npm install crawlee playwright

# Python
pip install crawlee[playwright]
playwright install

Similar to handling browser sessions in Puppeteer, Crawlee manages browser instances efficiently with automatic session rotation and fingerprint management.

Initial Project Structure

After installation, here's a recommended project structure:

Node.js Project

my-crawler/
├── src/
│   ├── main.js           # Main crawler script
│   └── routes.js         # Request handlers
├── storage/              # Data storage (auto-created)
├── package.json
└── .gitignore

Python Project

my-crawler/
├── main.py               # Main crawler script
├── requirements.txt
├── storage/              # Data storage (auto-created)
└── .gitignore

Common Installation Issues and Solutions

Issue: Playwright browsers not installing

Solution: Manually install browsers after installing Crawlee:

# Node.js
npx playwright install

# Python
playwright install

Issue: Permission errors on Linux/macOS

Solution: Use pip with the --user flag or use a virtual environment:

pip install --user crawlee[playwright]

Issue: Missing system dependencies

Solution: Install system dependencies for Playwright:

# Ubuntu/Debian
npx playwright install-deps

# Or manually
sudo apt-get install -y \
    libnss3 \
    libxss1 \
    libasound2 \
    libatk-bridge2.0-0 \
    libgtk-3-0

Issue: Node.js version too old

Solution: Update Node.js using nvm:

# Install nvm
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.0/install.sh | bash

# Install latest Node.js
nvm install node
nvm use node

Next Steps After Installation

Once Crawlee is installed, you can:

  1. Configure request routing: Set up different handlers for different URL patterns
  2. Add proxy support: Implement proxy rotation for large-scale scraping
  3. Set up data storage: Configure datasets, key-value stores, or custom storage
  4. Handle pagination: Implement logic to navigate to different pages using Puppeteer or Playwright
  5. Implement error handling: Add retry logic and error recovery mechanisms

Here's a complete starter example in JavaScript:

import { PlaywrightCrawler, Dataset } from 'crawlee';

const crawler = new PlaywrightCrawler({
    maxRequestsPerCrawl: 50,
    requestHandler: async ({ request, page, enqueueLinks, log }) => {
        log.info(`Processing: ${request.url}`);

        // Extract data
        const title = await page.title();
        const data = await page.$$eval('article', articles =>
            articles.map(article => ({
                heading: article.querySelector('h2')?.textContent,
                text: article.querySelector('p')?.textContent,
            }))
        );

        // Save data
        await Dataset.pushData({ url: request.url, title, items: data });

        // Enqueue new links
        await enqueueLinks({
            selector: 'a[href]',
            strategy: 'same-domain',
        });
    },
});

await crawler.run(['https://example.com']);

And in Python:

import asyncio
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext

async def main():
    crawler = PlaywrightCrawler(
        max_requests_per_crawl=50,
    )

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        context.log.info(f'Processing: {context.request.url}')

        # Extract data
        title = await context.page.title()
        articles = await context.page.query_selector_all('article')

        data = []
        for article in articles:
            heading = await article.query_selector('h2')
            text = await article.query_selector('p')
            data.append({
                'heading': await heading.text_content() if heading else None,
                'text': await text.text_content() if text else None,
            })

        # Save data
        await context.push_data({
            'url': context.request.url,
            'title': title,
            'items': data,
        })

        # Enqueue new links
        await context.enqueue_links()

    await crawler.run(['https://example.com'])

if __name__ == '__main__':
    asyncio.run(main())

Conclusion

Installing Crawlee is straightforward whether you're using Node.js or Python. The library's flexibility allows you to choose between lightweight HTTP crawling and full browser automation depending on your scraping needs. With proper installation and configuration, Crawlee provides a robust foundation for building scalable web scraping projects with features like automatic retries, proxy rotation, and intelligent request management.

For production deployments, consider using Docker containers and implementing proper error handling in Puppeteer-style approaches to ensure your crawlers run reliably at scale.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon