Table of contents

How do I use Scrapy with Docker?

Using Scrapy with Docker provides a consistent, isolated environment for your web scraping projects. Docker containers ensure your Scrapy spiders run reliably across different systems and make deployment to production environments seamless. This guide covers everything from basic containerization to production-ready configurations.

Why Use Scrapy with Docker?

Docker offers several advantages for Scrapy projects:

  • Environment Consistency: Eliminates "it works on my machine" problems
  • Dependency Management: Isolates Python packages and system dependencies
  • Scalability: Easy horizontal scaling of spider instances
  • Deployment: Simplified deployment to cloud platforms and servers
  • Version Control: Reproducible builds with specific package versions

Basic Dockerfile for Scrapy

Create a Dockerfile in your Scrapy project root:

# Use Python slim image for smaller container size
FROM python:3.11-slim

# Set working directory
WORKDIR /app

# Install system dependencies
RUN apt-get update && apt-get install -y \
    gcc \
    g++ \
    libxml2-dev \
    libxslt-dev \
    libffi-dev \
    libssl-dev \
    && rm -rf /var/lib/apt/lists/*

# Copy requirements file
COPY requirements.txt .

# Install Python dependencies
RUN pip install --no-cache-dir -r requirements.txt

# Copy project files
COPY . .

# Create non-root user for security
RUN useradd -m scrapy_user && chown -R scrapy_user:scrapy_user /app
USER scrapy_user

# Default command to run spider
CMD ["scrapy", "list"]

Requirements File Setup

Create a requirements.txt file with your dependencies:

scrapy>=2.8.0
scrapy-user-agents>=0.1.1
scrapy-rotating-proxies>=0.6.2
requests>=2.28.0
pandas>=1.5.0
python-dotenv>=0.19.0

Building and Running the Container

Build your Docker image:

# Build the image
docker build -t my-scrapy-project .

# Run a specific spider
docker run --rm my-scrapy-project scrapy crawl my_spider

# Run with output to host directory
docker run --rm -v $(pwd)/output:/app/output my-scrapy-project scrapy crawl my_spider -o output/data.json

Docker Compose Configuration

For more complex setups, use docker-compose.yml:

version: '3.8'

services:
  scrapy:
    build: .
    volumes:
      - ./output:/app/output
      - ./logs:/app/logs
    environment:
      - SCRAPY_SETTINGS_MODULE=myproject.settings
    depends_on:
      - redis
      - postgres

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"

  postgres:
    image: postgres:15
    environment:
      POSTGRES_DB: scrapy_data
      POSTGRES_USER: scrapy
      POSTGRES_PASSWORD: password
    volumes:
      - postgres_data:/var/lib/postgresql/data
    ports:
      - "5432:5432"

volumes:
  postgres_data:

Run with docker-compose:

# Start all services
docker-compose up

# Run a specific spider
docker-compose run --rm scrapy scrapy crawl my_spider

# Scale scrapy instances
docker-compose up --scale scrapy=3

Advanced Dockerfile with Multi-stage Build

For production environments, use multi-stage builds to reduce image size:

# Build stage
FROM python:3.11-slim as builder

WORKDIR /app

# Install build dependencies
RUN apt-get update && apt-get install -y \
    gcc \
    g++ \
    libxml2-dev \
    libxslt-dev \
    libffi-dev \
    libssl-dev

# Copy and install requirements
COPY requirements.txt .
RUN pip wheel --no-cache-dir --no-deps --wheel-dir /app/wheels -r requirements.txt

# Production stage
FROM python:3.11-slim

WORKDIR /app

# Install runtime dependencies only
RUN apt-get update && apt-get install -y \
    libxml2 \
    libxslt1.1 \
    && rm -rf /var/lib/apt/lists/*

# Copy wheels from builder stage
COPY --from=builder /app/wheels /wheels
COPY requirements.txt .

# Install Python packages from wheels
RUN pip install --no-cache /wheels/*

# Copy application code
COPY . .

# Create non-root user
RUN useradd -m scrapy_user && chown -R scrapy_user:scrapy_user /app
USER scrapy_user

# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
    CMD python -c "import scrapy; print('Scrapy is working')" || exit 1

CMD ["scrapy", "list"]

Environment Configuration

Use environment variables for configuration:

# settings.py
import os

# Use environment variables for sensitive data
USER_AGENT = os.getenv('USER_AGENT', 'my-scrapy-bot 1.0')
CONCURRENT_REQUESTS = int(os.getenv('CONCURRENT_REQUESTS', '16'))
DOWNLOAD_DELAY = float(os.getenv('DOWNLOAD_DELAY', '1'))

# Database configuration
DATABASE_URL = os.getenv('DATABASE_URL', 'postgresql://scrapy:password@postgres:5432/scrapy_data')

# Redis configuration for distributed crawling
REDIS_URL = os.getenv('REDIS_URL', 'redis://redis:6379')

Create a .env file for local development:

USER_AGENT=my-scrapy-bot 1.0
CONCURRENT_REQUESTS=16
DOWNLOAD_DELAY=1
DATABASE_URL=postgresql://scrapy:password@localhost:5432/scrapy_data
REDIS_URL=redis://localhost:6379

Production Deployment Strategies

1. Single Container Deployment

# Build production image
docker build -f Dockerfile.prod -t my-scrapy:latest .

# Run with resource limits
docker run -d \
  --name scrapy-worker \
  --memory=512m \
  --cpus=1 \
  --restart=unless-stopped \
  -v /host/logs:/app/logs \
  -v /host/output:/app/output \
  my-scrapy:latest scrapy crawl my_spider

2. Kubernetes Deployment

Create a k8s-deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: scrapy-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      app: scrapy
  template:
    metadata:
      labels:
        app: scrapy
    spec:
      containers:
      - name: scrapy
        image: my-scrapy:latest
        command: ["scrapy", "crawl", "my_spider"]
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        env:
        - name: CONCURRENT_REQUESTS
          value: "8"
        - name: DOWNLOAD_DELAY
          value: "2"

Handling Data Persistence

Volume Mounting for Output

# Mount output directory
docker run --rm \
  -v $(pwd)/scraped_data:/app/output \
  my-scrapy:latest \
  scrapy crawl my_spider -o output/results.json

Database Integration

For persistent data storage, similar to how to handle authentication in Puppeteer requires careful session management, Scrapy with Docker needs proper database configuration:

# pipelines.py
import psycopg2
import os

class PostgresPipeline:
    def __init__(self):
        self.connection = None
        self.cursor = None

    def open_spider(self, spider):
        db_settings = {
            'host': os.getenv('DB_HOST', 'postgres'),
            'database': os.getenv('DB_NAME', 'scrapy_data'),
            'user': os.getenv('DB_USER', 'scrapy'),
            'password': os.getenv('DB_PASSWORD', 'password'),
            'port': os.getenv('DB_PORT', '5432'),
        }
        self.connection = psycopg2.connect(**db_settings)
        self.cursor = self.connection.cursor()

    def process_item(self, item, spider):
        # Insert item into database
        insert_query = """
        INSERT INTO scraped_items (title, price, url) 
        VALUES (%s, %s, %s)
        """
        self.cursor.execute(insert_query, (item['title'], item['price'], item['url']))
        self.connection.commit()
        return item

    def close_spider(self, spider):
        self.cursor.close()
        self.connection.close()

Monitoring and Logging

Structured Logging Configuration

# settings.py
import os

# Logging configuration
LOG_LEVEL = os.getenv('LOG_LEVEL', 'INFO')
LOG_FILE = '/app/logs/scrapy.log'

# Custom log format for Docker
LOG_FORMAT = '%(levelname)s: %(message)s'

# Enable stats collection
STATS_CLASS = 'scrapy.statscollectors.MemoryStatsCollector'

Docker Compose with Monitoring

version: '3.8'

services:
  scrapy:
    build: .
    volumes:
      - ./logs:/app/logs
    environment:
      - LOG_LEVEL=INFO
    depends_on:
      - redis
    logging:
      driver: "json-file"
      options:
        max-size: "10m"
        max-file: "3"

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    volumes:
      - grafana_data:/var/lib/grafana

  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml

volumes:
  grafana_data:

Performance Optimization

Resource Management

# Optimize Python for containers
ENV PYTHONUNBUFFERED=1
ENV PYTHONDONTWRITEBYTECODE=1
ENV PIP_NO_CACHE_DIR=1
ENV PIP_DISABLE_PIP_VERSION_CHECK=1

# Use faster JSON library
RUN pip install ujson

# Optimize Scrapy settings for containers
ENV SCRAPY_SETTINGS_MODULE=myproject.docker_settings

Memory-Efficient Settings

# docker_settings.py
from .settings import *

# Optimize for container environment
CONCURRENT_REQUESTS = 8
CONCURRENT_REQUESTS_PER_DOMAIN = 4
DOWNLOAD_DELAY = 1

# Reduce memory usage
REACTOR_THREADPOOL_MAXSIZE = 10
DNS_TIMEOUT = 60
DOWNLOAD_TIMEOUT = 180

# Enable autothrottling
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0

Troubleshooting Common Issues

1. Permission Issues

# Fix permission issues
RUN useradd -m -u 1000 scrapy_user
RUN chown -R scrapy_user:scrapy_user /app
USER scrapy_user

2. Memory Limits

# Monitor memory usage
docker stats scrapy-container

# Set memory limits
docker run --memory=1g --memory-swap=2g my-scrapy:latest

3. Network Issues

# docker-compose.yml
services:
  scrapy:
    build: .
    network_mode: "host"  # Use host networking if needed
    # OR create custom network
    networks:
      - scrapy_network

networks:
  scrapy_network:
    driver: bridge

Best Practices

  1. Use Multi-stage Builds: Reduce final image size by separating build and runtime stages
  2. Non-root User: Always run containers as non-root for security
  3. Health Checks: Implement health checks for container orchestration
  4. Resource Limits: Set appropriate CPU and memory limits
  5. Logging: Use structured logging and proper log levels
  6. Environment Variables: Externalize configuration using environment variables
  7. Data Persistence: Use volumes for data that needs to persist between container restarts

Just as how can I use Puppeteer with Docker requires careful browser configuration in containers, Scrapy with Docker needs proper resource management and environment setup for optimal performance.

Conclusion

Docker provides an excellent foundation for deploying Scrapy applications. By containerizing your web scraping projects, you gain portability, scalability, and consistency across different environments. Start with a simple Dockerfile for development, then gradually add production features like multi-stage builds, monitoring, and orchestration as your needs grow.

Remember to regularly update your base images and dependencies, monitor resource usage, and implement proper logging for production deployments. With these practices, your Scrapy spiders will run reliably in any Docker-enabled environment.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon