Can I Run Crawlee Scrapers in Docker Containers?
Yes, you can absolutely run Crawlee scrapers in Docker containers, and it's actually a recommended approach for production deployments. Docker provides isolation, consistency across environments, and easier scaling for your web scraping projects. Crawlee works seamlessly with Docker, supporting both headless and headed browser automation.
Why Use Docker with Crawlee?
Docker containerization offers several advantages for Crawlee-based web scraping projects:
- Environment Consistency: Ensure your scraper runs identically across development, staging, and production
- Dependency Management: Bundle all required dependencies, including browser binaries
- Scalability: Deploy multiple container instances for horizontal scaling
- Isolation: Prevent conflicts with other applications on the same server
- Portability: Deploy to any Docker-compatible platform (AWS, GCP, Kubernetes, etc.)
Basic Dockerfile for Crawlee
Here's a production-ready Dockerfile for a Crawlee scraper using Node.js:
FROM node:18-alpine
# Install dependencies required for Playwright/Puppeteer
RUN apk add --no-cache \
chromium \
nss \
freetype \
harfbuzz \
ca-certificates \
ttf-freefont
# Set environment variables
ENV PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=true \
PUPPETEER_EXECUTABLE_PATH=/usr/bin/chromium-browser \
PLAYWRIGHT_SKIP_BROWSER_DOWNLOAD=1 \
PLAYWRIGHT_CHROMIUM_EXECUTABLE_PATH=/usr/bin/chromium-browser
# Create app directory
WORKDIR /app
# Copy package files
COPY package*.json ./
# Install dependencies
RUN npm ci --only=production
# Copy application code
COPY . .
# Run the scraper
CMD ["node", "src/main.js"]
For Debian-based images with more comprehensive browser support:
FROM node:18
# Install dependencies for Chromium
RUN apt-get update && apt-get install -y \
wget \
gnupg \
ca-certificates \
fonts-liberation \
libasound2 \
libatk-bridge2.0-0 \
libatk1.0-0 \
libcups2 \
libdbus-1-3 \
libdrm2 \
libgbm1 \
libgtk-3-0 \
libnspr4 \
libnss3 \
libxcomposite1 \
libxdamage1 \
libxrandr2 \
xdg-utils \
--no-install-recommends \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
# Copy package files
COPY package*.json ./
# Install dependencies and Playwright browsers
RUN npm ci
# Install Playwright browsers
RUN npx playwright install chromium
# Copy application code
COPY . .
CMD ["node", "src/main.js"]
Crawlee Scraper Example for Docker
Here's a sample Crawlee scraper designed to run in Docker:
// src/main.js
import { PlaywrightCrawler, Dataset } from 'crawlee';
const crawler = new PlaywrightCrawler({
// Use headless mode in Docker
headless: true,
// Limit concurrency in containerized environments
maxConcurrency: 5,
// Configure browser launch options
launchContext: {
launchOptions: {
args: [
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-dev-shm-usage',
'--disable-accelerated-2d-canvas',
'--disable-gpu',
],
},
},
async requestHandler({ request, page, enqueueLinks, log }) {
log.info(`Processing ${request.url}`);
const title = await page.title();
const content = await page.content();
await Dataset.pushData({
url: request.url,
title,
timestamp: new Date().toISOString(),
});
// Enqueue links for crawling
await enqueueLinks({
strategy: 'same-hostname',
});
},
failedRequestHandler({ request, log }) {
log.error(`Request ${request.url} failed`);
},
});
// Start the crawler
await crawler.run(['https://example.com']);
console.log('Crawler finished.');
Docker Compose Configuration
For local development and testing, use Docker Compose:
# docker-compose.yml
version: '3.8'
services:
crawlee-scraper:
build: .
container_name: crawlee-scraper
environment:
- NODE_ENV=production
- CRAWLEE_STORAGE_DIR=/app/storage
volumes:
- ./storage:/app/storage
- ./src:/app/src
mem_limit: 2g
cpus: 2
restart: unless-stopped
Run your scraper with Docker Compose:
# Build the image
docker-compose build
# Run the scraper
docker-compose up
# Run in detached mode
docker-compose up -d
# View logs
docker-compose logs -f
# Stop the container
docker-compose down
Python Crawlee in Docker
If you're using Crawlee for Python, here's a Dockerfile:
FROM python:3.11-slim
# Install system dependencies
RUN apt-get update && apt-get install -y \
wget \
gnupg \
ca-certificates \
fonts-liberation \
libnss3 \
libxss1 \
--no-install-recommends \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
# Copy requirements
COPY requirements.txt .
# Install Python dependencies
RUN pip install --no-cache-dir -r requirements.txt
# Install Playwright browsers
RUN playwright install chromium
RUN playwright install-deps chromium
# Copy application
COPY . .
CMD ["python", "src/main.py"]
Python scraper example:
# src/main.py
import asyncio
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
async def main() -> None:
crawler = PlaywrightCrawler(
headless=True,
max_requests_per_crawl=50,
browser_type='chromium',
)
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url}')
title = await context.page.title()
await context.push_data({
'url': context.request.url,
'title': title,
})
await context.enqueue_links()
await crawler.run(['https://example.com'])
if __name__ == '__main__':
asyncio.run(main())
Best Practices for Docker Deployments
1. Memory and Resource Limits
Browser automation is memory-intensive. Set appropriate limits:
docker run -m 2g --cpus="2.0" crawlee-scraper
2. Shared Memory Size
Increase shared memory to prevent browser crashes:
docker run --shm-size=2g crawlee-scraper
Or in Docker Compose:
services:
crawlee-scraper:
build: .
shm_size: '2gb'
3. Use Multi-Stage Builds
Optimize image size with multi-stage builds:
# Build stage
FROM node:18 AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
# Production stage
FROM node:18-slim
WORKDIR /app
COPY --from=builder /app .
RUN npx playwright install-deps chromium
RUN npx playwright install chromium
CMD ["node", "src/main.js"]
4. Persist Storage
Mount volumes to persist Crawlee's storage:
docker run -v $(pwd)/storage:/app/storage crawlee-scraper
5. Environment Variables
Configure scrapers via environment variables:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
maxConcurrency: parseInt(process.env.MAX_CONCURRENCY || '5'),
maxRequestsPerCrawl: parseInt(process.env.MAX_REQUESTS || '100'),
});
Building and Running Docker Images
Build your Docker image:
# Build the image
docker build -t my-crawlee-scraper .
# Run the container
docker run --name scraper my-crawlee-scraper
# Run with volume mounting
docker run -v $(pwd)/storage:/app/storage my-crawlee-scraper
# Run with environment variables
docker run -e MAX_CONCURRENCY=10 my-crawlee-scraper
# Run in interactive mode for debugging
docker run -it my-crawlee-scraper /bin/sh
Check container logs:
# Follow logs in real-time
docker logs -f scraper
# View last 100 lines
docker logs --tail 100 scraper
Handling Browser Configuration
When running Crawlee with Puppeteer or Playwright in Docker, you need specific browser arguments:
import { PuppeteerCrawler } from 'crawlee';
const crawler = new PuppeteerCrawler({
launchContext: {
launchOptions: {
headless: true,
args: [
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-dev-shm-usage',
'--disable-accelerated-2d-canvas',
'--no-first-run',
'--no-zygote',
'--single-process',
'--disable-gpu',
],
},
},
// ... rest of configuration
});
These arguments are crucial for stable browser automation in containerized environments.
Integration with Orchestration Platforms
Kubernetes Deployment
Deploy Crawlee scrapers to Kubernetes:
# crawlee-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: crawlee-scraper
spec:
replicas: 3
selector:
matchLabels:
app: crawlee-scraper
template:
metadata:
labels:
app: crawlee-scraper
spec:
containers:
- name: scraper
image: my-crawlee-scraper:latest
resources:
limits:
memory: "2Gi"
cpu: "1000m"
requests:
memory: "1Gi"
cpu: "500m"
volumeMounts:
- name: storage
mountPath: /app/storage
volumes:
- name: storage
emptyDir: {}
AWS ECS Task Definition
{
"family": "crawlee-scraper",
"containerDefinitions": [
{
"name": "scraper",
"image": "my-crawlee-scraper:latest",
"memory": 2048,
"cpu": 1024,
"essential": true,
"environment": [
{
"name": "NODE_ENV",
"value": "production"
}
],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/crawlee-scraper",
"awslogs-region": "us-east-1",
"awslogs-stream-prefix": "ecs"
}
}
}
]
}
Troubleshooting Docker Issues
Common Problems and Solutions
Browser crashes with "Navigation timeout":
- Increase shared memory: --shm-size=2g
- Add --disable-dev-shm-usage
to browser args
"Failed to launch the browser process":
- Ensure all dependencies are installed
- Add --no-sandbox
flag
- Verify Chromium is installed correctly
High memory usage: - Limit concurrency in Crawlee configuration - Set container memory limits - Close pages after processing
Permission errors:
- Add --no-sandbox
and --disable-setuid-sandbox
- Run container as non-root user
Monitoring and Logging
Implement proper logging for containerized scrapers:
import { PlaywrightCrawler, LogLevel } from 'crawlee';
const crawler = new PlaywrightCrawler({
// Set log level based on environment
logLevel: process.env.NODE_ENV === 'production'
? LogLevel.INFO
: LogLevel.DEBUG,
async requestHandler({ request, page, log }) {
log.info('Processing page', { url: request.url });
try {
// Scraping logic
const data = await page.evaluate(() => {
return {
title: document.title,
};
});
log.debug('Extracted data', { data });
} catch (error) {
log.error('Failed to process page', {
error: error.message,
stack: error.stack
});
}
},
});
Conclusion
Running Crawlee scrapers in Docker containers is not only possible but highly recommended for production deployments. Docker provides the isolation, consistency, and scalability needed for robust web scraping operations. By following the best practices outlined above—including proper resource allocation, browser configuration, and handling timeouts appropriately—you can deploy reliable, production-ready Crawlee scrapers that scale with your needs.
Whether you're deploying to a simple VPS, Kubernetes cluster, or managed container services like AWS ECS or Google Cloud Run, Docker ensures your Crawlee scrapers run consistently and reliably across all environments.