How do I deploy Scrapy spiders to production?
Deploying Scrapy spiders to production requires careful planning for scalability, reliability, and monitoring. This comprehensive guide covers the essential steps and best practices for running Scrapy spiders in production environments.
1. Preparation and Code Optimization
Before deploying, ensure your spider code is production-ready:
Configure Settings for Production
Create a production settings file settings_production.py
:
# settings_production.py
from .settings import *
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Configure delays and concurrency
DOWNLOAD_DELAY = 2
RANDOMIZE_DOWNLOAD_DELAY = 0.5
CONCURRENT_REQUESTS = 16
CONCURRENT_REQUESTS_PER_DOMAIN = 8
# Enable autothrottling
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 60
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Configure logging
LOG_LEVEL = 'INFO'
LOG_FILE = '/var/log/scrapy/scrapy.log'
# Enable telnet console for debugging
TELNETCONSOLE_ENABLED = True
TELNETCONSOLE_PORT = [6023, 6073]
# Configure item pipelines for production
ITEM_PIPELINES = {
'myproject.pipelines.ValidationPipeline': 300,
'myproject.pipelines.DatabasePipeline': 400,
}
# Database configuration
DATABASE_URL = 'postgresql://user:password@localhost/production_db'
# User agent rotation
USER_AGENT_LIST = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36'
]
Implement Robust Error Handling
# spiders/production_spider.py
import scrapy
from scrapy.exceptions import DropItem
import logging
class ProductionSpider(scrapy.Spider):
name = 'production_spider'
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.failed_urls = []
def parse(self, response):
try:
# Your parsing logic here
yield self.extract_data(response)
except Exception as e:
self.logger.error(f"Error parsing {response.url}: {e}")
self.failed_urls.append(response.url)
def closed(self, reason):
if self.failed_urls:
self.logger.warning(f"Failed URLs: {len(self.failed_urls)}")
# Optionally save failed URLs for retry
with open('/tmp/failed_urls.txt', 'w') as f:
f.write('\n'.join(self.failed_urls))
2. Containerization with Docker
Docker provides consistent deployment environments across different systems:
Create a Dockerfile
FROM python:3.9-slim
# Install system dependencies
RUN apt-get update && apt-get install -y \
gcc \
libxml2-dev \
libxslt-dev \
libffi-dev \
libssl-dev \
&& rm -rf /var/lib/apt/lists/*
# Set working directory
WORKDIR /app
# Copy requirements and install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy application code
COPY . .
# Create log directory
RUN mkdir -p /var/log/scrapy
# Create non-root user
RUN useradd -m scrapy
USER scrapy
# Expose telnet console port
EXPOSE 6023
# Default command
CMD ["scrapy", "crawl", "my_spider"]
Docker Compose for Development
# docker-compose.yml
version: '3.8'
services:
scrapy:
build: .
volumes:
- ./logs:/var/log/scrapy
- ./data:/app/data
environment:
- SCRAPY_SETTINGS_MODULE=myproject.settings_production
depends_on:
- redis
- postgresql
redis:
image: redis:6-alpine
ports:
- "6379:6379"
postgresql:
image: postgres:13
environment:
POSTGRES_DB: scrapy_db
POSTGRES_USER: scrapy
POSTGRES_PASSWORD: password
volumes:
- postgres_data:/var/lib/postgresql/data
volumes:
postgres_data:
3. Scaling with Scrapy-Redis
For distributed crawling, use Scrapy-Redis to share queues between multiple spider instances:
Configure Scrapy-Redis
# settings_distributed.py
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
SCHEDULER_PERSIST = True
REDIS_HOST = 'redis'
REDIS_PORT = 6379
REDIS_DB = 0
# Optional: Use Redis for item pipeline
ITEM_PIPELINES = {
'scrapy_redis.pipelines.RedisPipeline': 300,
}
Deploy Multiple Spider Instances
# Start multiple spider instances
docker run -d --name spider1 my-scrapy-image scrapy crawl my_spider
docker run -d --name spider2 my-scrapy-image scrapy crawl my_spider
docker run -d --name spider3 my-scrapy-image scrapy crawl my_spider
4. Scheduling and Orchestration
Using Cron for Simple Scheduling
# Add to crontab
# Run spider every hour
0 * * * * docker run --rm my-scrapy-image scrapy crawl my_spider
# Run spider daily at 2 AM
0 2 * * * docker run --rm my-scrapy-image scrapy crawl daily_spider
Using Apache Airflow for Complex Workflows
# airflow_dag.py
from airflow import DAG
from airflow.operators.docker_operator import DockerOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'data-team',
'depends_on_past': False,
'start_date': datetime(2024, 1, 1),
'retries': 2,
'retry_delay': timedelta(minutes=5),
}
dag = DAG(
'scrapy_pipeline',
default_args=default_args,
description='Scrapy spider pipeline',
schedule_interval='@daily',
catchup=False,
)
run_spider = DockerOperator(
task_id='run_scrapy_spider',
image='my-scrapy-image:latest',
command='scrapy crawl my_spider',
network_mode='bridge',
dag=dag,
)
5. Monitoring and Logging
Implement Comprehensive Logging
# Custom logging pipeline
class LoggingPipeline:
def __init__(self):
self.items_processed = 0
self.items_dropped = 0
def process_item(self, item, spider):
self.items_processed += 1
spider.logger.info(f"Processed item #{self.items_processed}: {item.get('title', 'Unknown')}")
return item
def close_spider(self, spider):
spider.logger.info(f"Spider finished. Processed: {self.items_processed}, Dropped: {self.items_dropped}")
Set Up Monitoring with Prometheus
# monitoring.py
from prometheus_client import Counter, Histogram, start_http_server
# Metrics
SCRAPED_ITEMS = Counter('scrapy_items_scraped_total', 'Total scraped items')
REQUEST_DURATION = Histogram('scrapy_request_duration_seconds', 'Request duration')
class MonitoringPipeline:
def process_item(self, item, spider):
SCRAPED_ITEMS.inc()
return item
# Start metrics server
start_http_server(8000)
6. Deployment Strategies
Blue-Green Deployment
#!/bin/bash
# deploy.sh
# Build new image
docker build -t my-scrapy-image:new .
# Test new version
docker run --rm my-scrapy-image:new scrapy check
# Deploy new version
docker stop scrapy-production
docker run -d --name scrapy-production-new my-scrapy-image:new
# Health check
if docker exec scrapy-production-new scrapy list | grep -q my_spider; then
docker rm scrapy-production
docker rename scrapy-production-new scrapy-production
docker tag my-scrapy-image:new my-scrapy-image:latest
else
docker stop scrapy-production-new
docker rm scrapy-production-new
echo "Deployment failed"
exit 1
fi
Kubernetes Deployment
# scrapy-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: scrapy-spider
spec:
replicas: 3
selector:
matchLabels:
app: scrapy-spider
template:
metadata:
labels:
app: scrapy-spider
spec:
containers:
- name: scrapy
image: my-scrapy-image:latest
command: ["scrapy", "crawl", "my_spider"]
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "1Gi"
cpu: "1"
env:
- name: SCRAPY_SETTINGS_MODULE
value: "myproject.settings_production"
7. Security Considerations
Implement Security Best Practices
# Secure settings
ROBOTSTXT_OBEY = True
HTTPERROR_ALLOWED_CODES = [404, 500]
# Use secure user agents
USER_AGENT = 'MyBot/1.0 (+http://www.example.com/bot)'
# Configure proxy rotation for security
ROTATING_PROXY_LIST_PATH = '/app/proxy_list.txt'
# Implement rate limiting
DOWNLOAD_DELAY = 3
RANDOMIZE_DOWNLOAD_DELAY = 0.5
8. Performance Optimization
Database Connection Pooling
# database_pipeline.py
import psycopg2.pool
class DatabasePipeline:
def __init__(self):
self.connection_pool = psycopg2.pool.ThreadedConnectionPool(
1, 20, # min and max connections
host='localhost',
database='scrapy_db',
user='scrapy',
password='password'
)
def process_item(self, item, spider):
conn = self.connection_pool.getconn()
try:
cursor = conn.cursor()
cursor.execute("INSERT INTO items (title, url) VALUES (%s, %s)",
(item['title'], item['url']))
conn.commit()
finally:
self.connection_pool.putconn(conn)
return item
9. Health Checks and Monitoring
Implement Health Check Endpoints
# health_check.py
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
import time
def health_check():
try:
# Check if spider can be instantiated
settings = get_project_settings()
process = CrawlerProcess(settings)
return True
except Exception as e:
print(f"Health check failed: {e}")
return False
if __name__ == '__main__':
if health_check():
print("Spider is healthy")
exit(0)
else:
print("Spider is unhealthy")
exit(1)
Best Practices Summary
- Environment Configuration: Use environment-specific settings files
- Error Handling: Implement comprehensive error handling and logging
- Resource Management: Configure appropriate delays and concurrency limits
- Monitoring: Set up metrics collection and alerting
- Security: Follow security best practices for web scraping
- Scalability: Use distributed crawling with Scrapy-Redis when needed
- Testing: Test thoroughly in staging environments before production deployment
When deploying browser-based scraping solutions, you might also consider using Docker containers for consistent environments, which provides similar benefits for handling complex web applications.
By following these practices, you'll have a robust, scalable, and maintainable Scrapy deployment that can handle production workloads effectively.