Table of contents

Can I Run Crawlee Scrapers in Docker Containers?

Yes, you can absolutely run Crawlee scrapers in Docker containers, and it's actually a recommended approach for production deployments. Docker provides isolation, consistency across environments, and easier scaling for your web scraping projects. Crawlee works seamlessly with Docker, supporting both headless and headed browser automation.

Why Use Docker with Crawlee?

Docker containerization offers several advantages for Crawlee-based web scraping projects:

  • Environment Consistency: Ensure your scraper runs identically across development, staging, and production
  • Dependency Management: Bundle all required dependencies, including browser binaries
  • Scalability: Deploy multiple container instances for horizontal scaling
  • Isolation: Prevent conflicts with other applications on the same server
  • Portability: Deploy to any Docker-compatible platform (AWS, GCP, Kubernetes, etc.)

Basic Dockerfile for Crawlee

Here's a production-ready Dockerfile for a Crawlee scraper using Node.js:

FROM node:18-alpine

# Install dependencies required for Playwright/Puppeteer
RUN apk add --no-cache \
    chromium \
    nss \
    freetype \
    harfbuzz \
    ca-certificates \
    ttf-freefont

# Set environment variables
ENV PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=true \
    PUPPETEER_EXECUTABLE_PATH=/usr/bin/chromium-browser \
    PLAYWRIGHT_SKIP_BROWSER_DOWNLOAD=1 \
    PLAYWRIGHT_CHROMIUM_EXECUTABLE_PATH=/usr/bin/chromium-browser

# Create app directory
WORKDIR /app

# Copy package files
COPY package*.json ./

# Install dependencies
RUN npm ci --only=production

# Copy application code
COPY . .

# Run the scraper
CMD ["node", "src/main.js"]

For Debian-based images with more comprehensive browser support:

FROM node:18

# Install dependencies for Chromium
RUN apt-get update && apt-get install -y \
    wget \
    gnupg \
    ca-certificates \
    fonts-liberation \
    libasound2 \
    libatk-bridge2.0-0 \
    libatk1.0-0 \
    libcups2 \
    libdbus-1-3 \
    libdrm2 \
    libgbm1 \
    libgtk-3-0 \
    libnspr4 \
    libnss3 \
    libxcomposite1 \
    libxdamage1 \
    libxrandr2 \
    xdg-utils \
    --no-install-recommends \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app

# Copy package files
COPY package*.json ./

# Install dependencies and Playwright browsers
RUN npm ci

# Install Playwright browsers
RUN npx playwright install chromium

# Copy application code
COPY . .

CMD ["node", "src/main.js"]

Crawlee Scraper Example for Docker

Here's a sample Crawlee scraper designed to run in Docker:

// src/main.js
import { PlaywrightCrawler, Dataset } from 'crawlee';

const crawler = new PlaywrightCrawler({
    // Use headless mode in Docker
    headless: true,

    // Limit concurrency in containerized environments
    maxConcurrency: 5,

    // Configure browser launch options
    launchContext: {
        launchOptions: {
            args: [
                '--no-sandbox',
                '--disable-setuid-sandbox',
                '--disable-dev-shm-usage',
                '--disable-accelerated-2d-canvas',
                '--disable-gpu',
            ],
        },
    },

    async requestHandler({ request, page, enqueueLinks, log }) {
        log.info(`Processing ${request.url}`);

        const title = await page.title();
        const content = await page.content();

        await Dataset.pushData({
            url: request.url,
            title,
            timestamp: new Date().toISOString(),
        });

        // Enqueue links for crawling
        await enqueueLinks({
            strategy: 'same-hostname',
        });
    },

    failedRequestHandler({ request, log }) {
        log.error(`Request ${request.url} failed`);
    },
});

// Start the crawler
await crawler.run(['https://example.com']);

console.log('Crawler finished.');

Docker Compose Configuration

For local development and testing, use Docker Compose:

# docker-compose.yml
version: '3.8'

services:
  crawlee-scraper:
    build: .
    container_name: crawlee-scraper
    environment:
      - NODE_ENV=production
      - CRAWLEE_STORAGE_DIR=/app/storage
    volumes:
      - ./storage:/app/storage
      - ./src:/app/src
    mem_limit: 2g
    cpus: 2
    restart: unless-stopped

Run your scraper with Docker Compose:

# Build the image
docker-compose build

# Run the scraper
docker-compose up

# Run in detached mode
docker-compose up -d

# View logs
docker-compose logs -f

# Stop the container
docker-compose down

Python Crawlee in Docker

If you're using Crawlee for Python, here's a Dockerfile:

FROM python:3.11-slim

# Install system dependencies
RUN apt-get update && apt-get install -y \
    wget \
    gnupg \
    ca-certificates \
    fonts-liberation \
    libnss3 \
    libxss1 \
    --no-install-recommends \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app

# Copy requirements
COPY requirements.txt .

# Install Python dependencies
RUN pip install --no-cache-dir -r requirements.txt

# Install Playwright browsers
RUN playwright install chromium
RUN playwright install-deps chromium

# Copy application
COPY . .

CMD ["python", "src/main.py"]

Python scraper example:

# src/main.py
import asyncio
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext

async def main() -> None:
    crawler = PlaywrightCrawler(
        headless=True,
        max_requests_per_crawl=50,
        browser_type='chromium',
    )

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url}')

        title = await context.page.title()

        await context.push_data({
            'url': context.request.url,
            'title': title,
        })

        await context.enqueue_links()

    await crawler.run(['https://example.com'])

if __name__ == '__main__':
    asyncio.run(main())

Best Practices for Docker Deployments

1. Memory and Resource Limits

Browser automation is memory-intensive. Set appropriate limits:

docker run -m 2g --cpus="2.0" crawlee-scraper

2. Shared Memory Size

Increase shared memory to prevent browser crashes:

docker run --shm-size=2g crawlee-scraper

Or in Docker Compose:

services:
  crawlee-scraper:
    build: .
    shm_size: '2gb'

3. Use Multi-Stage Builds

Optimize image size with multi-stage builds:

# Build stage
FROM node:18 AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .

# Production stage
FROM node:18-slim
WORKDIR /app
COPY --from=builder /app .
RUN npx playwright install-deps chromium
RUN npx playwright install chromium
CMD ["node", "src/main.js"]

4. Persist Storage

Mount volumes to persist Crawlee's storage:

docker run -v $(pwd)/storage:/app/storage crawlee-scraper

5. Environment Variables

Configure scrapers via environment variables:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    maxConcurrency: parseInt(process.env.MAX_CONCURRENCY || '5'),
    maxRequestsPerCrawl: parseInt(process.env.MAX_REQUESTS || '100'),
});

Building and Running Docker Images

Build your Docker image:

# Build the image
docker build -t my-crawlee-scraper .

# Run the container
docker run --name scraper my-crawlee-scraper

# Run with volume mounting
docker run -v $(pwd)/storage:/app/storage my-crawlee-scraper

# Run with environment variables
docker run -e MAX_CONCURRENCY=10 my-crawlee-scraper

# Run in interactive mode for debugging
docker run -it my-crawlee-scraper /bin/sh

Check container logs:

# Follow logs in real-time
docker logs -f scraper

# View last 100 lines
docker logs --tail 100 scraper

Handling Browser Configuration

When running Crawlee with Puppeteer or Playwright in Docker, you need specific browser arguments:

import { PuppeteerCrawler } from 'crawlee';

const crawler = new PuppeteerCrawler({
    launchContext: {
        launchOptions: {
            headless: true,
            args: [
                '--no-sandbox',
                '--disable-setuid-sandbox',
                '--disable-dev-shm-usage',
                '--disable-accelerated-2d-canvas',
                '--no-first-run',
                '--no-zygote',
                '--single-process',
                '--disable-gpu',
            ],
        },
    },
    // ... rest of configuration
});

These arguments are crucial for stable browser automation in containerized environments.

Integration with Orchestration Platforms

Kubernetes Deployment

Deploy Crawlee scrapers to Kubernetes:

# crawlee-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: crawlee-scraper
spec:
  replicas: 3
  selector:
    matchLabels:
      app: crawlee-scraper
  template:
    metadata:
      labels:
        app: crawlee-scraper
    spec:
      containers:
      - name: scraper
        image: my-crawlee-scraper:latest
        resources:
          limits:
            memory: "2Gi"
            cpu: "1000m"
          requests:
            memory: "1Gi"
            cpu: "500m"
        volumeMounts:
        - name: storage
          mountPath: /app/storage
      volumes:
      - name: storage
        emptyDir: {}

AWS ECS Task Definition

{
  "family": "crawlee-scraper",
  "containerDefinitions": [
    {
      "name": "scraper",
      "image": "my-crawlee-scraper:latest",
      "memory": 2048,
      "cpu": 1024,
      "essential": true,
      "environment": [
        {
          "name": "NODE_ENV",
          "value": "production"
        }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/crawlee-scraper",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "ecs"
        }
      }
    }
  ]
}

Troubleshooting Docker Issues

Common Problems and Solutions

Browser crashes with "Navigation timeout": - Increase shared memory: --shm-size=2g - Add --disable-dev-shm-usage to browser args

"Failed to launch the browser process": - Ensure all dependencies are installed - Add --no-sandbox flag - Verify Chromium is installed correctly

High memory usage: - Limit concurrency in Crawlee configuration - Set container memory limits - Close pages after processing

Permission errors: - Add --no-sandbox and --disable-setuid-sandbox - Run container as non-root user

Monitoring and Logging

Implement proper logging for containerized scrapers:

import { PlaywrightCrawler, LogLevel } from 'crawlee';

const crawler = new PlaywrightCrawler({
    // Set log level based on environment
    logLevel: process.env.NODE_ENV === 'production'
        ? LogLevel.INFO
        : LogLevel.DEBUG,

    async requestHandler({ request, page, log }) {
        log.info('Processing page', { url: request.url });

        try {
            // Scraping logic
            const data = await page.evaluate(() => {
                return {
                    title: document.title,
                };
            });

            log.debug('Extracted data', { data });
        } catch (error) {
            log.error('Failed to process page', {
                error: error.message,
                stack: error.stack
            });
        }
    },
});

Conclusion

Running Crawlee scrapers in Docker containers is not only possible but highly recommended for production deployments. Docker provides the isolation, consistency, and scalability needed for robust web scraping operations. By following the best practices outlined above—including proper resource allocation, browser configuration, and handling timeouts appropriately—you can deploy reliable, production-ready Crawlee scrapers that scale with your needs.

Whether you're deploying to a simple VPS, Kubernetes cluster, or managed container services like AWS ECS or Google Cloud Run, Docker ensures your Crawlee scrapers run consistently and reliably across all environments.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon