How do I Install Crawlee for Python Web Scraping Projects?

Crawlee for Python is a powerful web scraping and browser automation library that provides a robust framework for building reliable scrapers. Installing Crawlee for Python is straightforward, but understanding the proper setup process ensures you have a stable development environment for your web scraping projects.

Prerequisites

Before installing Crawlee for Python, ensure you have the following prerequisites:

Python 3.9 or higher: Crawlee for Python requires Python 3.9+
pip: Python's package installer (usually comes with Python)
Virtual environment tool (recommended): venv, virtualenv, or conda

You can check your Python version with:

python --version
# or
python3 --version

Basic Installation with pip

The simplest way to install Crawlee for Python is using pip:

pip install crawlee

This command installs the core Crawlee library along with its basic dependencies. However, for production web scraping projects, you'll typically need additional components.

Installing Crawlee with Specific Browser Support

Crawlee for Python supports multiple browser automation libraries. You can install Crawlee with specific browser support using pip extras:

Installing with Playwright Support

Playwright is a modern browser automation library that supports Chromium, Firefox, and WebKit. To install Crawlee with Playwright:

pip install 'crawlee[playwright]'

After installation, you need to install the browser binaries:

playwright install

This downloads the necessary browser binaries (Chromium, Firefox, and WebKit by default). If you only need specific browsers:

# Install only Chromium
playwright install chromium

# Install Chromium and Firefox
playwright install chromium firefox

Installing with BeautifulSoup Support

For simpler HTML parsing without full browser automation, you can use BeautifulSoup:

pip install 'crawlee[beautifulsoup]'

This is ideal for scraping static websites that don't require JavaScript rendering.

Installing with All Features

To install Crawlee with all available features:

pip install 'crawlee[all]'

This includes support for Playwright, BeautifulSoup, and other optional dependencies.

Setting Up a Virtual Environment (Recommended)

Creating a virtual environment is highly recommended to avoid dependency conflicts and maintain project isolation.

Using venv (Built-in)

# Create a virtual environment
python -m venv crawlee-env

# Activate on Linux/macOS
source crawlee-env/bin/activate

# Activate on Windows
crawlee-env\Scripts\activate

# Install Crawlee
pip install 'crawlee[playwright]'

Using virtualenv

# Install virtualenv if not already installed
pip install virtualenv

# Create virtual environment
virtualenv crawlee-env

# Activate (same as venv)
source crawlee-env/bin/activate  # Linux/macOS
crawlee-env\Scripts\activate     # Windows

# Install Crawlee
pip install 'crawlee[playwright]'

Using conda

# Create conda environment
conda create -n crawlee-env python=3.11

# Activate environment
conda activate crawlee-env

# Install Crawlee
pip install 'crawlee[playwright]'

Verifying the Installation

After installation, verify that Crawlee is properly installed:

import crawlee

print(f"Crawlee version: {crawlee.__version__}")

You can also run a simple test to ensure everything works:

from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext

async def main():
    crawler = PlaywrightCrawler()

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        print(f"Processing: {context.request.url}")
        title = await context.page.title()
        print(f"Page title: {title}")

    await crawler.run(['https://example.com'])

if __name__ == '__main__':
    import asyncio
    asyncio.run(main())

Installing from Source (Development)

For development or testing the latest features, you can install Crawlee from the GitHub repository:

# Clone the repository
git clone https://github.com/apify/crawlee-python.git
cd crawlee-python

# Install in editable mode with development dependencies
pip install -e '.[dev]'

Common Installation Issues and Solutions

Issue: Python Version Incompatibility

If you see errors about Python version requirements:

# Check your Python version
python --version

# Use a specific Python version
python3.11 -m pip install 'crawlee[playwright]'

Issue: Playwright Browser Installation Fails

If playwright install fails, try:

# Install with specific permissions
sudo playwright install

# Or set the PLAYWRIGHT_BROWSERS_PATH environment variable
export PLAYWRIGHT_BROWSERS_PATH=/path/to/browsers
playwright install

Issue: Permission Denied Errors

On Linux/macOS, if you encounter permission errors:

# Use --user flag
pip install --user 'crawlee[playwright]'

# Or use sudo (not recommended)
sudo pip install 'crawlee[playwright]'

Dependency Management with requirements.txt

For reproducible installations across environments, create a requirements.txt file:

crawlee[playwright]==0.2.0
playwright==1.40.0

Then install with:

pip install -r requirements.txt

Using Poetry for Dependency Management

Poetry provides advanced dependency management:

# Initialize Poetry project
poetry init

# Add Crawlee
poetry add crawlee[playwright]

# Install dependencies
poetry install

Your pyproject.toml will contain:

[tool.poetry.dependencies]
python = "^3.9"
crawlee = {extras = ["playwright"], version = "^0.2.0"}

Docker Installation

For containerized environments, create a Dockerfile:

FROM python:3.11-slim

# Install system dependencies for Playwright
RUN apt-get update && apt-get install -y \
    wget \
    gnupg \
    && rm -rf /var/lib/apt/lists/*

# Set working directory
WORKDIR /app

# Copy requirements
COPY requirements.txt .

# Install Python dependencies
RUN pip install --no-cache-dir -r requirements.txt

# Install Playwright browsers
RUN playwright install --with-deps chromium

# Copy application code
COPY . .

CMD ["python", "scraper.py"]

Build and run:

docker build -t crawlee-scraper .
docker run crawlee-scraper

Configuration After Installation

After installing Crawlee, you may want to configure certain aspects:

Setting Storage Directory

Crawlee stores data in a default storage directory. You can customize this:

from crawlee import Configuration

config = Configuration.get_global_configuration()
config.storage_dir = './my-crawlee-storage'

Configuring Logging

Adjust logging levels for debugging:

import logging

logging.basicConfig(level=logging.INFO)

Next Steps After Installation

Once Crawlee is installed, you can start building web scrapers. Here are some essential next steps:

Learn the Crawlee architecture: Understand crawlers, request handlers, and storage
Explore different crawler types: PlaywrightCrawler, BeautifulSoupCrawler, HttpCrawler
Implement data extraction: Use selectors and parsing techniques similar to handling DOM elements in browser automation
Handle navigation: Learn to navigate complex sites, similar to navigating pages in Puppeteer
Configure request management: Set up proxies, headers, and retries

Example: Complete Project Setup

Here's a complete example of setting up a new Crawlee project:

# Create project directory
mkdir my-crawlee-project
cd my-crawlee-project

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install Crawlee with Playwright
pip install 'crawlee[playwright]'

# Install Playwright browsers
playwright install chromium

# Create requirements.txt
pip freeze > requirements.txt

# Create main scraper file
cat > scraper.py << 'EOF'
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext

async def main():
    crawler = PlaywrightCrawler(
        max_requests_per_crawl=50,
        headless=True,
    )

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        url = context.request.url
        print(f"Scraping: {url}")

        # Extract data
        title = await context.page.title()

        # Save data
        await context.push_data({
            'url': url,
            'title': title,
        })

    await crawler.run(['https://example.com'])

    # Export data
    await crawler.export_data('results.json')

if __name__ == '__main__':
    import asyncio
    asyncio.run(main())
EOF

# Run the scraper
python scraper.py

Updating Crawlee

To update Crawlee to the latest version:

pip install --upgrade crawlee

# Or with extras
pip install --upgrade 'crawlee[playwright]'

Uninstalling Crawlee

If you need to uninstall Crawlee:

pip uninstall crawlee

# Also remove Playwright if installed
pip uninstall playwright

Conclusion

Installing Crawlee for Python is a straightforward process that involves installing the package via pip, optionally installing browser dependencies with Playwright, and setting up a proper virtual environment. By following the installation steps outlined in this guide, you'll have a robust foundation for building scalable web scraping projects.

Remember to always use virtual environments to maintain project isolation, keep your dependencies up to date, and refer to the official Crawlee documentation for the latest features and best practices.

For more advanced web scraping scenarios, consider exploring browser automation techniques and understanding how to handle timeouts and error handling in your scraping workflows.

Table of contents