How do I use Scrapy with Docker?
Using Scrapy with Docker provides a consistent, isolated environment for your web scraping projects. Docker containers ensure your Scrapy spiders run reliably across different systems and make deployment to production environments seamless. This guide covers everything from basic containerization to production-ready configurations.
Why Use Scrapy with Docker?
Docker offers several advantages for Scrapy projects:
- Environment Consistency: Eliminates "it works on my machine" problems
- Dependency Management: Isolates Python packages and system dependencies
- Scalability: Easy horizontal scaling of spider instances
- Deployment: Simplified deployment to cloud platforms and servers
- Version Control: Reproducible builds with specific package versions
Basic Dockerfile for Scrapy
Create a Dockerfile
in your Scrapy project root:
# Use Python slim image for smaller container size
FROM python:3.11-slim
# Set working directory
WORKDIR /app
# Install system dependencies
RUN apt-get update && apt-get install -y \
gcc \
g++ \
libxml2-dev \
libxslt-dev \
libffi-dev \
libssl-dev \
&& rm -rf /var/lib/apt/lists/*
# Copy requirements file
COPY requirements.txt .
# Install Python dependencies
RUN pip install --no-cache-dir -r requirements.txt
# Copy project files
COPY . .
# Create non-root user for security
RUN useradd -m scrapy_user && chown -R scrapy_user:scrapy_user /app
USER scrapy_user
# Default command to run spider
CMD ["scrapy", "list"]
Requirements File Setup
Create a requirements.txt
file with your dependencies:
scrapy>=2.8.0
scrapy-user-agents>=0.1.1
scrapy-rotating-proxies>=0.6.2
requests>=2.28.0
pandas>=1.5.0
python-dotenv>=0.19.0
Building and Running the Container
Build your Docker image:
# Build the image
docker build -t my-scrapy-project .
# Run a specific spider
docker run --rm my-scrapy-project scrapy crawl my_spider
# Run with output to host directory
docker run --rm -v $(pwd)/output:/app/output my-scrapy-project scrapy crawl my_spider -o output/data.json
Docker Compose Configuration
For more complex setups, use docker-compose.yml
:
version: '3.8'
services:
scrapy:
build: .
volumes:
- ./output:/app/output
- ./logs:/app/logs
environment:
- SCRAPY_SETTINGS_MODULE=myproject.settings
depends_on:
- redis
- postgres
redis:
image: redis:7-alpine
ports:
- "6379:6379"
postgres:
image: postgres:15
environment:
POSTGRES_DB: scrapy_data
POSTGRES_USER: scrapy
POSTGRES_PASSWORD: password
volumes:
- postgres_data:/var/lib/postgresql/data
ports:
- "5432:5432"
volumes:
postgres_data:
Run with docker-compose:
# Start all services
docker-compose up
# Run a specific spider
docker-compose run --rm scrapy scrapy crawl my_spider
# Scale scrapy instances
docker-compose up --scale scrapy=3
Advanced Dockerfile with Multi-stage Build
For production environments, use multi-stage builds to reduce image size:
# Build stage
FROM python:3.11-slim as builder
WORKDIR /app
# Install build dependencies
RUN apt-get update && apt-get install -y \
gcc \
g++ \
libxml2-dev \
libxslt-dev \
libffi-dev \
libssl-dev
# Copy and install requirements
COPY requirements.txt .
RUN pip wheel --no-cache-dir --no-deps --wheel-dir /app/wheels -r requirements.txt
# Production stage
FROM python:3.11-slim
WORKDIR /app
# Install runtime dependencies only
RUN apt-get update && apt-get install -y \
libxml2 \
libxslt1.1 \
&& rm -rf /var/lib/apt/lists/*
# Copy wheels from builder stage
COPY --from=builder /app/wheels /wheels
COPY requirements.txt .
# Install Python packages from wheels
RUN pip install --no-cache /wheels/*
# Copy application code
COPY . .
# Create non-root user
RUN useradd -m scrapy_user && chown -R scrapy_user:scrapy_user /app
USER scrapy_user
# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
CMD python -c "import scrapy; print('Scrapy is working')" || exit 1
CMD ["scrapy", "list"]
Environment Configuration
Use environment variables for configuration:
# settings.py
import os
# Use environment variables for sensitive data
USER_AGENT = os.getenv('USER_AGENT', 'my-scrapy-bot 1.0')
CONCURRENT_REQUESTS = int(os.getenv('CONCURRENT_REQUESTS', '16'))
DOWNLOAD_DELAY = float(os.getenv('DOWNLOAD_DELAY', '1'))
# Database configuration
DATABASE_URL = os.getenv('DATABASE_URL', 'postgresql://scrapy:password@postgres:5432/scrapy_data')
# Redis configuration for distributed crawling
REDIS_URL = os.getenv('REDIS_URL', 'redis://redis:6379')
Create a .env
file for local development:
USER_AGENT=my-scrapy-bot 1.0
CONCURRENT_REQUESTS=16
DOWNLOAD_DELAY=1
DATABASE_URL=postgresql://scrapy:password@localhost:5432/scrapy_data
REDIS_URL=redis://localhost:6379
Production Deployment Strategies
1. Single Container Deployment
# Build production image
docker build -f Dockerfile.prod -t my-scrapy:latest .
# Run with resource limits
docker run -d \
--name scrapy-worker \
--memory=512m \
--cpus=1 \
--restart=unless-stopped \
-v /host/logs:/app/logs \
-v /host/output:/app/output \
my-scrapy:latest scrapy crawl my_spider
2. Kubernetes Deployment
Create a k8s-deployment.yaml
:
apiVersion: apps/v1
kind: Deployment
metadata:
name: scrapy-deployment
spec:
replicas: 3
selector:
matchLabels:
app: scrapy
template:
metadata:
labels:
app: scrapy
spec:
containers:
- name: scrapy
image: my-scrapy:latest
command: ["scrapy", "crawl", "my_spider"]
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
env:
- name: CONCURRENT_REQUESTS
value: "8"
- name: DOWNLOAD_DELAY
value: "2"
Handling Data Persistence
Volume Mounting for Output
# Mount output directory
docker run --rm \
-v $(pwd)/scraped_data:/app/output \
my-scrapy:latest \
scrapy crawl my_spider -o output/results.json
Database Integration
For persistent data storage, similar to how to handle authentication in Puppeteer requires careful session management, Scrapy with Docker needs proper database configuration:
# pipelines.py
import psycopg2
import os
class PostgresPipeline:
def __init__(self):
self.connection = None
self.cursor = None
def open_spider(self, spider):
db_settings = {
'host': os.getenv('DB_HOST', 'postgres'),
'database': os.getenv('DB_NAME', 'scrapy_data'),
'user': os.getenv('DB_USER', 'scrapy'),
'password': os.getenv('DB_PASSWORD', 'password'),
'port': os.getenv('DB_PORT', '5432'),
}
self.connection = psycopg2.connect(**db_settings)
self.cursor = self.connection.cursor()
def process_item(self, item, spider):
# Insert item into database
insert_query = """
INSERT INTO scraped_items (title, price, url)
VALUES (%s, %s, %s)
"""
self.cursor.execute(insert_query, (item['title'], item['price'], item['url']))
self.connection.commit()
return item
def close_spider(self, spider):
self.cursor.close()
self.connection.close()
Monitoring and Logging
Structured Logging Configuration
# settings.py
import os
# Logging configuration
LOG_LEVEL = os.getenv('LOG_LEVEL', 'INFO')
LOG_FILE = '/app/logs/scrapy.log'
# Custom log format for Docker
LOG_FORMAT = '%(levelname)s: %(message)s'
# Enable stats collection
STATS_CLASS = 'scrapy.statscollectors.MemoryStatsCollector'
Docker Compose with Monitoring
version: '3.8'
services:
scrapy:
build: .
volumes:
- ./logs:/app/logs
environment:
- LOG_LEVEL=INFO
depends_on:
- redis
logging:
driver: "json-file"
options:
max-size: "10m"
max-file: "3"
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
volumes:
- grafana_data:/var/lib/grafana
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
volumes:
grafana_data:
Performance Optimization
Resource Management
# Optimize Python for containers
ENV PYTHONUNBUFFERED=1
ENV PYTHONDONTWRITEBYTECODE=1
ENV PIP_NO_CACHE_DIR=1
ENV PIP_DISABLE_PIP_VERSION_CHECK=1
# Use faster JSON library
RUN pip install ujson
# Optimize Scrapy settings for containers
ENV SCRAPY_SETTINGS_MODULE=myproject.docker_settings
Memory-Efficient Settings
# docker_settings.py
from .settings import *
# Optimize for container environment
CONCURRENT_REQUESTS = 8
CONCURRENT_REQUESTS_PER_DOMAIN = 4
DOWNLOAD_DELAY = 1
# Reduce memory usage
REACTOR_THREADPOOL_MAXSIZE = 10
DNS_TIMEOUT = 60
DOWNLOAD_TIMEOUT = 180
# Enable autothrottling
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0
Troubleshooting Common Issues
1. Permission Issues
# Fix permission issues
RUN useradd -m -u 1000 scrapy_user
RUN chown -R scrapy_user:scrapy_user /app
USER scrapy_user
2. Memory Limits
# Monitor memory usage
docker stats scrapy-container
# Set memory limits
docker run --memory=1g --memory-swap=2g my-scrapy:latest
3. Network Issues
# docker-compose.yml
services:
scrapy:
build: .
network_mode: "host" # Use host networking if needed
# OR create custom network
networks:
- scrapy_network
networks:
scrapy_network:
driver: bridge
Best Practices
- Use Multi-stage Builds: Reduce final image size by separating build and runtime stages
- Non-root User: Always run containers as non-root for security
- Health Checks: Implement health checks for container orchestration
- Resource Limits: Set appropriate CPU and memory limits
- Logging: Use structured logging and proper log levels
- Environment Variables: Externalize configuration using environment variables
- Data Persistence: Use volumes for data that needs to persist between container restarts
Just as how can I use Puppeteer with Docker requires careful browser configuration in containers, Scrapy with Docker needs proper resource management and environment setup for optimal performance.
Conclusion
Docker provides an excellent foundation for deploying Scrapy applications. By containerizing your web scraping projects, you gain portability, scalability, and consistency across different environments. Start with a simple Dockerfile for development, then gradually add production features like multi-stage builds, monitoring, and orchestration as your needs grow.
Remember to regularly update your base images and dependencies, monitor resource usage, and implement proper logging for production deployments. With these practices, your Scrapy spiders will run reliably in any Docker-enabled environment.