Table of contents

How do I monitor Scrapy spiders in production?

Monitoring Scrapy spiders in production is crucial for maintaining reliable web scraping operations. Effective monitoring helps you detect issues early, optimize performance, and ensure your spiders continue running smoothly. This comprehensive guide covers various monitoring approaches, tools, and best practices for production Scrapy deployments.

Essential Monitoring Components

1. Logging Configuration

Proper logging is the foundation of spider monitoring. Configure comprehensive logging to capture spider activities, errors, and performance metrics.

# settings.py
import os

# Enable comprehensive logging
LOG_LEVEL = 'INFO'
LOG_FILE = 'scrapy.log'

# Custom log format with timestamps and request details
LOG_FORMAT = '%(levelname)s: %(message)s'

# Enable stats collection
STATS_CLASS = 'scrapy.statscollectors.MemoryStatsCollector'

# Log all HTTP requests and responses for debugging
LOG_ENABLED = True
LOG_ENCODING = 'utf-8'

# Custom logging settings
LOGGING = {
    'version': 1,
    'disable_existing_loggers': False,
    'formatters': {
        'verbose': {
            'format': '{levelname} {asctime} {module} {process:d} {thread:d} {message}',
            'style': '{',
        },
    },
    'handlers': {
        'file': {
            'level': 'INFO',
            'class': 'logging.FileHandler',
            'filename': 'spider_production.log',
            'formatter': 'verbose',
        },
        'console': {
            'level': 'DEBUG',
            'class': 'logging.StreamHandler',
            'formatter': 'verbose',
        },
    },
    'loggers': {
        'scrapy': {
            'handlers': ['file', 'console'],
            'level': 'INFO',
            'propagate': False,
        },
    },
}

2. Custom Statistics Collection

Implement custom statistics to track spider-specific metrics beyond Scrapy's default stats.

# middlewares.py
from scrapy import signals
from scrapy.exceptions import NotConfigured
import time
import psutil
import logging

class ProductionMonitoringMiddleware:
    def __init__(self, stats, settings):
        self.stats = stats
        self.start_time = None
        self.logger = logging.getLogger(__name__)

    @classmethod
    def from_crawler(cls, crawler):
        if not crawler.settings.getbool('PRODUCTION_MONITORING_ENABLED'):
            raise NotConfigured('Production monitoring disabled')

        middleware = cls(crawler.stats, crawler.settings)
        crawler.signals.connect(middleware.spider_opened, signal=signals.spider_opened)
        crawler.signals.connect(middleware.spider_closed, signal=signals.spider_closed)
        crawler.signals.connect(middleware.item_scraped, signal=signals.item_scraped)
        crawler.signals.connect(middleware.request_reached_downloader, 
                              signal=signals.request_reached_downloader)

        return middleware

    def spider_opened(self, spider):
        self.start_time = time.time()
        self.logger.info(f"Spider {spider.name} started monitoring")

        # Log system resources
        memory_usage = psutil.virtual_memory().percent
        cpu_usage = psutil.cpu_percent()
        self.stats.set_value('monitoring/memory_usage_start', memory_usage)
        self.stats.set_value('monitoring/cpu_usage_start', cpu_usage)

    def spider_closed(self, spider, reason):
        if self.start_time:
            runtime = time.time() - self.start_time
            self.stats.set_value('monitoring/total_runtime', runtime)

        # Final resource usage
        memory_usage = psutil.virtual_memory().percent
        cpu_usage = psutil.cpu_percent()
        self.stats.set_value('monitoring/memory_usage_end', memory_usage)
        self.stats.set_value('monitoring/cpu_usage_end', cpu_usage)

        self.logger.info(f"Spider {spider.name} monitoring completed. Reason: {reason}")

    def item_scraped(self, item, response, spider):
        # Track items per minute
        current_time = time.time()
        if self.start_time:
            elapsed_minutes = (current_time - self.start_time) / 60
            items_per_minute = self.stats.get_value('item_scraped_count', 0) / max(elapsed_minutes, 1)
            self.stats.set_value('monitoring/items_per_minute', items_per_minute)

    def request_reached_downloader(self, request, spider):
        # Monitor request patterns
        domain = request.url.split('/')[2] if len(request.url.split('/')) > 2 else 'unknown'
        self.stats.inc_value(f'monitoring/requests_per_domain/{domain}')

3. Health Check Endpoints

Create health check endpoints to monitor spider status programmatically.

# health_check.py
import json
import time
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from twisted.web import server, resource
from twisted.internet import reactor
import threading

class HealthCheckResource(resource.Resource):
    isLeaf = True

    def __init__(self, crawler_process):
        self.crawler_process = crawler_process
        self.start_time = time.time()

    def render_GET(self, request):
        request.setHeader(b"content-type", b"application/json")

        health_data = {
            "status": "healthy",
            "uptime": time.time() - self.start_time,
            "active_spiders": len(self.crawler_process.crawlers),
            "timestamp": time.time()
        }

        # Check if any crawlers are running
        running_crawlers = []
        for crawler in self.crawler_process.crawlers:
            if hasattr(crawler, 'spider') and crawler.spider:
                stats = crawler.stats.get_stats()
                running_crawlers.append({
                    "name": crawler.spider.name,
                    "items_scraped": stats.get('item_scraped_count', 0),
                    "pages_crawled": stats.get('response_received_count', 0),
                    "errors": stats.get('spider_exceptions', 0)
                })

        health_data["running_spiders"] = running_crawlers

        if not running_crawlers:
            health_data["status"] = "no_active_spiders"

        return json.dumps(health_data, indent=2).encode('utf-8')

def start_health_check_server(crawler_process, port=8080):
    """Start health check HTTP server in a separate thread"""
    root = HealthCheckResource(crawler_process)
    site = server.Site(root)

    def run_server():
        reactor.listenTCP(port, site)
        print(f"Health check server started on port {port}")

    server_thread = threading.Thread(target=run_server)
    server_thread.daemon = True
    server_thread.start()

Advanced Monitoring Strategies

4. Database Monitoring Integration

Track spider performance and results in a database for historical analysis.

# monitoring_pipeline.py
import sqlite3
import time
from datetime import datetime
import json

class MonitoringPipeline:
    def __init__(self, db_path='spider_monitoring.db'):
        self.db_path = db_path
        self.setup_database()

    def setup_database(self):
        conn = sqlite3.connect(self.db_path)
        cursor = conn.cursor()

        # Create monitoring table
        cursor.execute('''
            CREATE TABLE IF NOT EXISTS spider_runs (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                spider_name TEXT,
                start_time TIMESTAMP,
                end_time TIMESTAMP,
                items_scraped INTEGER,
                pages_crawled INTEGER,
                errors INTEGER,
                status TEXT,
                runtime_seconds REAL,
                memory_usage REAL,
                cpu_usage REAL
            )
        ''')

        # Create real-time metrics table
        cursor.execute('''
            CREATE TABLE IF NOT EXISTS real_time_metrics (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                spider_name TEXT,
                timestamp TIMESTAMP,
                metric_name TEXT,
                metric_value REAL
            )
        ''')

        conn.commit()
        conn.close()

    def open_spider(self, spider):
        self.spider_start_time = time.time()
        self.spider_name = spider.name

    def close_spider(self, spider):
        conn = sqlite3.connect(self.db_path)
        cursor = conn.cursor()

        stats = spider.crawler.stats.get_stats()
        end_time = time.time()
        runtime = end_time - self.spider_start_time

        cursor.execute('''
            INSERT INTO spider_runs 
            (spider_name, start_time, end_time, items_scraped, pages_crawled, 
             errors, status, runtime_seconds, memory_usage, cpu_usage)
            VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
        ''', (
            self.spider_name,
            datetime.fromtimestamp(self.spider_start_time),
            datetime.fromtimestamp(end_time),
            stats.get('item_scraped_count', 0),
            stats.get('response_received_count', 0),
            stats.get('spider_exceptions', 0),
            'completed',
            runtime,
            stats.get('monitoring/memory_usage_end', 0),
            stats.get('monitoring/cpu_usage_end', 0)
        ))

        conn.commit()
        conn.close()

    def process_item(self, item, spider):
        # Log real-time metrics
        self.log_metric(spider.name, 'items_processed', 1)
        return item

    def log_metric(self, spider_name, metric_name, value):
        conn = sqlite3.connect(self.db_path)
        cursor = conn.cursor()

        cursor.execute('''
            INSERT INTO real_time_metrics (spider_name, timestamp, metric_name, metric_value)
            VALUES (?, ?, ?, ?)
        ''', (spider_name, datetime.now(), metric_name, value))

        conn.commit()
        conn.close()

5. Email and Slack Alerting

Set up automated alerts for spider failures and performance issues.

# alerting.py
import smtplib
import requests
import json
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart
from scrapy import signals

class AlertingMiddleware:
    def __init__(self, settings):
        self.email_settings = {
            'smtp_server': settings.get('ALERT_SMTP_SERVER'),
            'smtp_port': settings.get('ALERT_SMTP_PORT', 587),
            'username': settings.get('ALERT_EMAIL_USERNAME'),
            'password': settings.get('ALERT_EMAIL_PASSWORD'),
            'recipients': settings.get('ALERT_EMAIL_RECIPIENTS', [])
        }
        self.slack_webhook = settings.get('ALERT_SLACK_WEBHOOK')
        self.error_threshold = settings.get('ALERT_ERROR_THRESHOLD', 10)

    @classmethod
    def from_crawler(cls, crawler):
        middleware = cls(crawler.settings)
        crawler.signals.connect(middleware.spider_error, signal=signals.spider_error)
        crawler.signals.connect(middleware.spider_closed, signal=signals.spider_closed)
        return middleware

    def spider_error(self, failure, response, spider):
        error_count = spider.crawler.stats.get_value('spider_exceptions', 0)

        if error_count > self.error_threshold:
            message = f"Spider {spider.name} has exceeded error threshold ({error_count} errors)"
            self.send_alert("Spider Error Alert", message, "error")

    def spider_closed(self, spider, reason):
        if reason != 'finished':
            message = f"Spider {spider.name} closed unexpectedly. Reason: {reason}"
            self.send_alert("Spider Closure Alert", message, "warning")

    def send_alert(self, subject, message, alert_type):
        # Send email alert
        if self.email_settings['smtp_server']:
            self.send_email_alert(subject, message)

        # Send Slack alert
        if self.slack_webhook:
            self.send_slack_alert(subject, message, alert_type)

    def send_email_alert(self, subject, message):
        try:
            msg = MIMEMultipart()
            msg['From'] = self.email_settings['username']
            msg['To'] = ', '.join(self.email_settings['recipients'])
            msg['Subject'] = subject

            msg.attach(MIMEText(message, 'plain'))

            server = smtplib.SMTP(self.email_settings['smtp_server'], 
                                self.email_settings['smtp_port'])
            server.starttls()
            server.login(self.email_settings['username'], 
                        self.email_settings['password'])

            text = msg.as_string()
            server.sendmail(self.email_settings['username'], 
                          self.email_settings['recipients'], text)
            server.quit()
        except Exception as e:
            print(f"Failed to send email alert: {e}")

    def send_slack_alert(self, subject, message, alert_type):
        color_map = {
            'error': '#FF0000',
            'warning': '#FFA500',
            'info': '#00FF00'
        }

        payload = {
            "attachments": [{
                "color": color_map.get(alert_type, '#808080'),
                "title": subject,
                "text": message,
                "ts": int(time.time())
            }]
        }

        try:
            response = requests.post(self.slack_webhook, 
                                   data=json.dumps(payload),
                                   headers={'Content-Type': 'application/json'})
            response.raise_for_status()
        except Exception as e:
            print(f"Failed to send Slack alert: {e}")

Monitoring Tools Integration

6. Prometheus and Grafana Integration

Export Scrapy metrics to Prometheus for advanced monitoring and visualization.

# prometheus_exporter.py
from prometheus_client import Counter, Histogram, Gauge, start_http_server
from scrapy import signals
import time

class PrometheusMiddleware:
    def __init__(self):
        # Define metrics
        self.items_scraped = Counter('scrapy_items_scraped_total', 
                                   'Total items scraped', ['spider'])
        self.requests_made = Counter('scrapy_requests_total', 
                                   'Total requests made', ['spider', 'status'])
        self.response_time = Histogram('scrapy_response_time_seconds', 
                                     'Response time in seconds', ['spider'])
        self.active_spiders = Gauge('scrapy_active_spiders', 
                                  'Number of active spiders')
        self.spider_runtime = Gauge('scrapy_spider_runtime_seconds', 
                                  'Spider runtime in seconds', ['spider'])

        # Start Prometheus metrics server
        start_http_server(8000)

    @classmethod
    def from_crawler(cls, crawler):
        middleware = cls()
        crawler.signals.connect(middleware.spider_opened, signal=signals.spider_opened)
        crawler.signals.connect(middleware.spider_closed, signal=signals.spider_closed)
        crawler.signals.connect(middleware.item_scraped, signal=signals.item_scraped)
        crawler.signals.connect(middleware.response_received, signal=signals.response_received)
        return middleware

    def spider_opened(self, spider):
        self.active_spiders.inc()
        spider.start_time = time.time()

    def spider_closed(self, spider):
        self.active_spiders.dec()
        if hasattr(spider, 'start_time'):
            runtime = time.time() - spider.start_time
            self.spider_runtime.labels(spider=spider.name).set(runtime)

    def item_scraped(self, item, response, spider):
        self.items_scraped.labels(spider=spider.name).inc()

    def response_received(self, response, request, spider):
        self.requests_made.labels(spider=spider.name, 
                                status=response.status).inc()

        if hasattr(request, 'meta') and 'download_timeout' in request.meta:
            response_time = time.time() - request.meta.get('start_time', time.time())
            self.response_time.labels(spider=spider.name).observe(response_time)

Production Deployment Monitoring

7. Docker Container Monitoring

When running Scrapy in Docker containers, implement container-specific monitoring.

# docker-compose.yml with monitoring
version: '3.8'
services:
  scrapy-spider:
    build: .
    environment:
      - SCRAPY_SETTINGS_MODULE=myproject.settings.production
    volumes:
      - ./logs:/app/logs
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 3
    ports:
      - "8080:8080"  # Health check endpoint
      - "8000:8000"  # Prometheus metrics

  prometheus:
    image: prom/prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml

  grafana:
    image: grafana/grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin

8. Command Line Monitoring Tools

Use command-line tools to monitor running spiders and system resources.

# Monitor Scrapy spider processes
ps aux | grep scrapy

# Check system resource usage
htop

# Monitor log files in real-time
tail -f spider_production.log

# Check spider statistics
curl http://localhost:8080/health | jq

# Monitor network connections
netstat -an | grep :80

# Check disk usage for logs and data
df -h
du -sh logs/

# Monitor memory usage of Scrapy processes
pmap -x $(pidof python)

9. Log Aggregation and Analysis

Set up centralized logging for multiple spider instances.

# structured_logging.py
import json
import logging
from datetime import datetime

class StructuredLogFormatter(logging.Formatter):
    def format(self, record):
        log_entry = {
            'timestamp': datetime.utcnow().isoformat(),
            'level': record.levelname,
            'logger': record.name,
            'message': record.getMessage(),
            'module': record.module,
            'function': record.funcName,
            'line': record.lineno
        }

        # Add spider-specific context if available
        if hasattr(record, 'spider'):
            log_entry['spider'] = record.spider
        if hasattr(record, 'url'):
            log_entry['url'] = record.url
        if hasattr(record, 'status_code'):
            log_entry['status_code'] = record.status_code

        return json.dumps(log_entry)

# Usage in settings.py
LOGGING = {
    'version': 1,
    'formatters': {
        'structured': {
            '()': 'myproject.logging.StructuredLogFormatter',
        },
    },
    'handlers': {
        'structured_file': {
            'level': 'INFO',
            'class': 'logging.handlers.RotatingFileHandler',
            'filename': 'scrapy_structured.log',
            'maxBytes': 10485760,  # 10MB
            'backupCount': 5,
            'formatter': 'structured',
        },
    },
    'loggers': {
        'scrapy': {
            'handlers': ['structured_file'],
            'level': 'INFO',
        },
    },
}

Performance Monitoring Scripts

Create monitoring scripts to track key performance indicators.

# monitor.py
#!/usr/bin/env python3
import sqlite3
import time
import psutil
from datetime import datetime, timedelta

class SpiderMonitor:
    def __init__(self, db_path='spider_monitoring.db'):
        self.db_path = db_path

    def get_recent_runs(self, hours=24):
        """Get spider runs from the last N hours"""
        conn = sqlite3.connect(self.db_path)
        cursor = conn.cursor()

        since = datetime.now() - timedelta(hours=hours)
        cursor.execute('''
            SELECT spider_name, start_time, end_time, items_scraped, 
                   pages_crawled, errors, runtime_seconds
            FROM spider_runs 
            WHERE start_time > ?
            ORDER BY start_time DESC
        ''', (since,))

        runs = cursor.fetchall()
        conn.close()
        return runs

    def get_performance_stats(self):
        """Calculate performance statistics"""
        runs = self.get_recent_runs(24)

        if not runs:
            return {"message": "No recent runs found"}

        total_items = sum(run[3] for run in runs)
        total_pages = sum(run[4] for run in runs)
        total_errors = sum(run[5] for run in runs)
        avg_runtime = sum(run[6] for run in runs) / len(runs)

        return {
            "total_runs": len(runs),
            "total_items_scraped": total_items,
            "total_pages_crawled": total_pages,
            "total_errors": total_errors,
            "average_runtime": avg_runtime,
            "error_rate": (total_errors / total_pages) * 100 if total_pages > 0 else 0,
            "items_per_hour": total_items / 24
        }

    def check_system_health(self):
        """Check system resource usage"""
        return {
            "cpu_percent": psutil.cpu_percent(interval=1),
            "memory_percent": psutil.virtual_memory().percent,
            "disk_percent": psutil.disk_usage('/').percent,
            "load_average": psutil.getloadavg() if hasattr(psutil, 'getloadavg') else None
        }

    def generate_report(self):
        """Generate comprehensive monitoring report"""
        perf_stats = self.get_performance_stats()
        system_health = self.check_system_health()

        report = {
            "timestamp": datetime.now().isoformat(),
            "performance": perf_stats,
            "system_health": system_health
        }

        return report

if __name__ == "__main__":
    monitor = SpiderMonitor()
    report = monitor.generate_report()
    print(json.dumps(report, indent=2))

Best Practices for Production Monitoring

Performance Optimization

  • Monitor memory usage and implement memory management strategies
  • Track request/response ratios to identify bottlenecks
  • Set up automated scaling based on performance metrics
  • Use connection pooling and optimize concurrent requests

Error Handling

  • Implement comprehensive error categorization and tracking
  • Set up automated retry mechanisms for transient failures
  • Monitor error patterns to identify systematic issues
  • Create alerts for critical failure thresholds

Security Monitoring

  • Track blocked requests and potential bot detection
  • Monitor IP rotation and proxy performance
  • Log authentication failures and security-related events
  • Implement rate limiting monitoring

Resource Management

  • Monitor disk space usage for logs and data storage
  • Track network bandwidth consumption
  • Implement resource usage alerts and automatic cleanup
  • Set up log rotation and archival policies

Conclusion

Effective monitoring of Scrapy spiders in production requires a multi-layered approach combining logging, metrics collection, health checks, and alerting. By implementing comprehensive monitoring strategies similar to those used in monitoring network requests in browser automation, you can ensure reliable and efficient web scraping operations.

The monitoring setup should evolve with your infrastructure needs, incorporating tools like Prometheus, Grafana, and centralized logging systems. Regular review of monitoring data helps optimize spider performance and prevent issues before they impact your scraping operations. For complex deployment scenarios, consider implementing containerized monitoring solutions to maintain consistency across different environments.

Remember to establish monitoring baselines, set up automated alerts for anomalies, and regularly review and update your monitoring strategy as your scraping requirements evolve. Effective monitoring not only helps maintain system reliability but also provides valuable insights for optimizing your web scraping infrastructure.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon