Table of contents

How do I deploy Scrapy spiders to production?

Deploying Scrapy spiders to production requires careful planning for scalability, reliability, and monitoring. This comprehensive guide covers the essential steps and best practices for running Scrapy spiders in production environments.

1. Preparation and Code Optimization

Before deploying, ensure your spider code is production-ready:

Configure Settings for Production

Create a production settings file settings_production.py:

# settings_production.py
from .settings import *

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure delays and concurrency
DOWNLOAD_DELAY = 2
RANDOMIZE_DOWNLOAD_DELAY = 0.5
CONCURRENT_REQUESTS = 16
CONCURRENT_REQUESTS_PER_DOMAIN = 8

# Enable autothrottling
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 60
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

# Configure logging
LOG_LEVEL = 'INFO'
LOG_FILE = '/var/log/scrapy/scrapy.log'

# Enable telnet console for debugging
TELNETCONSOLE_ENABLED = True
TELNETCONSOLE_PORT = [6023, 6073]

# Configure item pipelines for production
ITEM_PIPELINES = {
    'myproject.pipelines.ValidationPipeline': 300,
    'myproject.pipelines.DatabasePipeline': 400,
}

# Database configuration
DATABASE_URL = 'postgresql://user:password@localhost/production_db'

# User agent rotation
USER_AGENT_LIST = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36'
]

Implement Robust Error Handling

# spiders/production_spider.py
import scrapy
from scrapy.exceptions import DropItem
import logging

class ProductionSpider(scrapy.Spider):
    name = 'production_spider'

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.failed_urls = []

    def parse(self, response):
        try:
            # Your parsing logic here
            yield self.extract_data(response)
        except Exception as e:
            self.logger.error(f"Error parsing {response.url}: {e}")
            self.failed_urls.append(response.url)

    def closed(self, reason):
        if self.failed_urls:
            self.logger.warning(f"Failed URLs: {len(self.failed_urls)}")
            # Optionally save failed URLs for retry
            with open('/tmp/failed_urls.txt', 'w') as f:
                f.write('\n'.join(self.failed_urls))

2. Containerization with Docker

Docker provides consistent deployment environments across different systems:

Create a Dockerfile

FROM python:3.9-slim

# Install system dependencies
RUN apt-get update && apt-get install -y \
    gcc \
    libxml2-dev \
    libxslt-dev \
    libffi-dev \
    libssl-dev \
    && rm -rf /var/lib/apt/lists/*

# Set working directory
WORKDIR /app

# Copy requirements and install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY . .

# Create log directory
RUN mkdir -p /var/log/scrapy

# Create non-root user
RUN useradd -m scrapy
USER scrapy

# Expose telnet console port
EXPOSE 6023

# Default command
CMD ["scrapy", "crawl", "my_spider"]

Docker Compose for Development

# docker-compose.yml
version: '3.8'

services:
  scrapy:
    build: .
    volumes:
      - ./logs:/var/log/scrapy
      - ./data:/app/data
    environment:
      - SCRAPY_SETTINGS_MODULE=myproject.settings_production
    depends_on:
      - redis
      - postgresql

  redis:
    image: redis:6-alpine
    ports:
      - "6379:6379"

  postgresql:
    image: postgres:13
    environment:
      POSTGRES_DB: scrapy_db
      POSTGRES_USER: scrapy
      POSTGRES_PASSWORD: password
    volumes:
      - postgres_data:/var/lib/postgresql/data

volumes:
  postgres_data:

3. Scaling with Scrapy-Redis

For distributed crawling, use Scrapy-Redis to share queues between multiple spider instances:

Configure Scrapy-Redis

# settings_distributed.py
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
SCHEDULER_PERSIST = True

REDIS_HOST = 'redis'
REDIS_PORT = 6379
REDIS_DB = 0

# Optional: Use Redis for item pipeline
ITEM_PIPELINES = {
    'scrapy_redis.pipelines.RedisPipeline': 300,
}

Deploy Multiple Spider Instances

# Start multiple spider instances
docker run -d --name spider1 my-scrapy-image scrapy crawl my_spider
docker run -d --name spider2 my-scrapy-image scrapy crawl my_spider
docker run -d --name spider3 my-scrapy-image scrapy crawl my_spider

4. Scheduling and Orchestration

Using Cron for Simple Scheduling

# Add to crontab
# Run spider every hour
0 * * * * docker run --rm my-scrapy-image scrapy crawl my_spider

# Run spider daily at 2 AM
0 2 * * * docker run --rm my-scrapy-image scrapy crawl daily_spider

Using Apache Airflow for Complex Workflows

# airflow_dag.py
from airflow import DAG
from airflow.operators.docker_operator import DockerOperator
from datetime import datetime, timedelta

default_args = {
    'owner': 'data-team',
    'depends_on_past': False,
    'start_date': datetime(2024, 1, 1),
    'retries': 2,
    'retry_delay': timedelta(minutes=5),
}

dag = DAG(
    'scrapy_pipeline',
    default_args=default_args,
    description='Scrapy spider pipeline',
    schedule_interval='@daily',
    catchup=False,
)

run_spider = DockerOperator(
    task_id='run_scrapy_spider',
    image='my-scrapy-image:latest',
    command='scrapy crawl my_spider',
    network_mode='bridge',
    dag=dag,
)

5. Monitoring and Logging

Implement Comprehensive Logging

# Custom logging pipeline
class LoggingPipeline:
    def __init__(self):
        self.items_processed = 0
        self.items_dropped = 0

    def process_item(self, item, spider):
        self.items_processed += 1
        spider.logger.info(f"Processed item #{self.items_processed}: {item.get('title', 'Unknown')}")
        return item

    def close_spider(self, spider):
        spider.logger.info(f"Spider finished. Processed: {self.items_processed}, Dropped: {self.items_dropped}")

Set Up Monitoring with Prometheus

# monitoring.py
from prometheus_client import Counter, Histogram, start_http_server

# Metrics
SCRAPED_ITEMS = Counter('scrapy_items_scraped_total', 'Total scraped items')
REQUEST_DURATION = Histogram('scrapy_request_duration_seconds', 'Request duration')

class MonitoringPipeline:
    def process_item(self, item, spider):
        SCRAPED_ITEMS.inc()
        return item

# Start metrics server
start_http_server(8000)

6. Deployment Strategies

Blue-Green Deployment

#!/bin/bash
# deploy.sh

# Build new image
docker build -t my-scrapy-image:new .

# Test new version
docker run --rm my-scrapy-image:new scrapy check

# Deploy new version
docker stop scrapy-production
docker run -d --name scrapy-production-new my-scrapy-image:new

# Health check
if docker exec scrapy-production-new scrapy list | grep -q my_spider; then
    docker rm scrapy-production
    docker rename scrapy-production-new scrapy-production
    docker tag my-scrapy-image:new my-scrapy-image:latest
else
    docker stop scrapy-production-new
    docker rm scrapy-production-new
    echo "Deployment failed"
    exit 1
fi

Kubernetes Deployment

# scrapy-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: scrapy-spider
spec:
  replicas: 3
  selector:
    matchLabels:
      app: scrapy-spider
  template:
    metadata:
      labels:
        app: scrapy-spider
    spec:
      containers:
      - name: scrapy
        image: my-scrapy-image:latest
        command: ["scrapy", "crawl", "my_spider"]
        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"
          limits:
            memory: "1Gi"
            cpu: "1"
        env:
        - name: SCRAPY_SETTINGS_MODULE
          value: "myproject.settings_production"

7. Security Considerations

Implement Security Best Practices

# Secure settings
ROBOTSTXT_OBEY = True
HTTPERROR_ALLOWED_CODES = [404, 500]

# Use secure user agents
USER_AGENT = 'MyBot/1.0 (+http://www.example.com/bot)'

# Configure proxy rotation for security
ROTATING_PROXY_LIST_PATH = '/app/proxy_list.txt'

# Implement rate limiting
DOWNLOAD_DELAY = 3
RANDOMIZE_DOWNLOAD_DELAY = 0.5

8. Performance Optimization

Database Connection Pooling

# database_pipeline.py
import psycopg2.pool

class DatabasePipeline:
    def __init__(self):
        self.connection_pool = psycopg2.pool.ThreadedConnectionPool(
            1, 20,  # min and max connections
            host='localhost',
            database='scrapy_db',
            user='scrapy',
            password='password'
        )

    def process_item(self, item, spider):
        conn = self.connection_pool.getconn()
        try:
            cursor = conn.cursor()
            cursor.execute("INSERT INTO items (title, url) VALUES (%s, %s)", 
                         (item['title'], item['url']))
            conn.commit()
        finally:
            self.connection_pool.putconn(conn)
        return item

9. Health Checks and Monitoring

Implement Health Check Endpoints

# health_check.py
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
import time

def health_check():
    try:
        # Check if spider can be instantiated
        settings = get_project_settings()
        process = CrawlerProcess(settings)
        return True
    except Exception as e:
        print(f"Health check failed: {e}")
        return False

if __name__ == '__main__':
    if health_check():
        print("Spider is healthy")
        exit(0)
    else:
        print("Spider is unhealthy")
        exit(1)

Best Practices Summary

  1. Environment Configuration: Use environment-specific settings files
  2. Error Handling: Implement comprehensive error handling and logging
  3. Resource Management: Configure appropriate delays and concurrency limits
  4. Monitoring: Set up metrics collection and alerting
  5. Security: Follow security best practices for web scraping
  6. Scalability: Use distributed crawling with Scrapy-Redis when needed
  7. Testing: Test thoroughly in staging environments before production deployment

When deploying browser-based scraping solutions, you might also consider using Docker containers for consistent environments, which provides similar benefits for handling complex web applications.

By following these practices, you'll have a robust, scalable, and maintainable Scrapy deployment that can handle production workloads effectively.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon