How do I monitor Scrapy spiders in production?
Monitoring Scrapy spiders in production is crucial for maintaining reliable web scraping operations. Effective monitoring helps you detect issues early, optimize performance, and ensure your spiders continue running smoothly. This comprehensive guide covers various monitoring approaches, tools, and best practices for production Scrapy deployments.
Essential Monitoring Components
1. Logging Configuration
Proper logging is the foundation of spider monitoring. Configure comprehensive logging to capture spider activities, errors, and performance metrics.
# settings.py
import os
# Enable comprehensive logging
LOG_LEVEL = 'INFO'
LOG_FILE = 'scrapy.log'
# Custom log format with timestamps and request details
LOG_FORMAT = '%(levelname)s: %(message)s'
# Enable stats collection
STATS_CLASS = 'scrapy.statscollectors.MemoryStatsCollector'
# Log all HTTP requests and responses for debugging
LOG_ENABLED = True
LOG_ENCODING = 'utf-8'
# Custom logging settings
LOGGING = {
'version': 1,
'disable_existing_loggers': False,
'formatters': {
'verbose': {
'format': '{levelname} {asctime} {module} {process:d} {thread:d} {message}',
'style': '{',
},
},
'handlers': {
'file': {
'level': 'INFO',
'class': 'logging.FileHandler',
'filename': 'spider_production.log',
'formatter': 'verbose',
},
'console': {
'level': 'DEBUG',
'class': 'logging.StreamHandler',
'formatter': 'verbose',
},
},
'loggers': {
'scrapy': {
'handlers': ['file', 'console'],
'level': 'INFO',
'propagate': False,
},
},
}
2. Custom Statistics Collection
Implement custom statistics to track spider-specific metrics beyond Scrapy's default stats.
# middlewares.py
from scrapy import signals
from scrapy.exceptions import NotConfigured
import time
import psutil
import logging
class ProductionMonitoringMiddleware:
def __init__(self, stats, settings):
self.stats = stats
self.start_time = None
self.logger = logging.getLogger(__name__)
@classmethod
def from_crawler(cls, crawler):
if not crawler.settings.getbool('PRODUCTION_MONITORING_ENABLED'):
raise NotConfigured('Production monitoring disabled')
middleware = cls(crawler.stats, crawler.settings)
crawler.signals.connect(middleware.spider_opened, signal=signals.spider_opened)
crawler.signals.connect(middleware.spider_closed, signal=signals.spider_closed)
crawler.signals.connect(middleware.item_scraped, signal=signals.item_scraped)
crawler.signals.connect(middleware.request_reached_downloader,
signal=signals.request_reached_downloader)
return middleware
def spider_opened(self, spider):
self.start_time = time.time()
self.logger.info(f"Spider {spider.name} started monitoring")
# Log system resources
memory_usage = psutil.virtual_memory().percent
cpu_usage = psutil.cpu_percent()
self.stats.set_value('monitoring/memory_usage_start', memory_usage)
self.stats.set_value('monitoring/cpu_usage_start', cpu_usage)
def spider_closed(self, spider, reason):
if self.start_time:
runtime = time.time() - self.start_time
self.stats.set_value('monitoring/total_runtime', runtime)
# Final resource usage
memory_usage = psutil.virtual_memory().percent
cpu_usage = psutil.cpu_percent()
self.stats.set_value('monitoring/memory_usage_end', memory_usage)
self.stats.set_value('monitoring/cpu_usage_end', cpu_usage)
self.logger.info(f"Spider {spider.name} monitoring completed. Reason: {reason}")
def item_scraped(self, item, response, spider):
# Track items per minute
current_time = time.time()
if self.start_time:
elapsed_minutes = (current_time - self.start_time) / 60
items_per_minute = self.stats.get_value('item_scraped_count', 0) / max(elapsed_minutes, 1)
self.stats.set_value('monitoring/items_per_minute', items_per_minute)
def request_reached_downloader(self, request, spider):
# Monitor request patterns
domain = request.url.split('/')[2] if len(request.url.split('/')) > 2 else 'unknown'
self.stats.inc_value(f'monitoring/requests_per_domain/{domain}')
3. Health Check Endpoints
Create health check endpoints to monitor spider status programmatically.
# health_check.py
import json
import time
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from twisted.web import server, resource
from twisted.internet import reactor
import threading
class HealthCheckResource(resource.Resource):
isLeaf = True
def __init__(self, crawler_process):
self.crawler_process = crawler_process
self.start_time = time.time()
def render_GET(self, request):
request.setHeader(b"content-type", b"application/json")
health_data = {
"status": "healthy",
"uptime": time.time() - self.start_time,
"active_spiders": len(self.crawler_process.crawlers),
"timestamp": time.time()
}
# Check if any crawlers are running
running_crawlers = []
for crawler in self.crawler_process.crawlers:
if hasattr(crawler, 'spider') and crawler.spider:
stats = crawler.stats.get_stats()
running_crawlers.append({
"name": crawler.spider.name,
"items_scraped": stats.get('item_scraped_count', 0),
"pages_crawled": stats.get('response_received_count', 0),
"errors": stats.get('spider_exceptions', 0)
})
health_data["running_spiders"] = running_crawlers
if not running_crawlers:
health_data["status"] = "no_active_spiders"
return json.dumps(health_data, indent=2).encode('utf-8')
def start_health_check_server(crawler_process, port=8080):
"""Start health check HTTP server in a separate thread"""
root = HealthCheckResource(crawler_process)
site = server.Site(root)
def run_server():
reactor.listenTCP(port, site)
print(f"Health check server started on port {port}")
server_thread = threading.Thread(target=run_server)
server_thread.daemon = True
server_thread.start()
Advanced Monitoring Strategies
4. Database Monitoring Integration
Track spider performance and results in a database for historical analysis.
# monitoring_pipeline.py
import sqlite3
import time
from datetime import datetime
import json
class MonitoringPipeline:
def __init__(self, db_path='spider_monitoring.db'):
self.db_path = db_path
self.setup_database()
def setup_database(self):
conn = sqlite3.connect(self.db_path)
cursor = conn.cursor()
# Create monitoring table
cursor.execute('''
CREATE TABLE IF NOT EXISTS spider_runs (
id INTEGER PRIMARY KEY AUTOINCREMENT,
spider_name TEXT,
start_time TIMESTAMP,
end_time TIMESTAMP,
items_scraped INTEGER,
pages_crawled INTEGER,
errors INTEGER,
status TEXT,
runtime_seconds REAL,
memory_usage REAL,
cpu_usage REAL
)
''')
# Create real-time metrics table
cursor.execute('''
CREATE TABLE IF NOT EXISTS real_time_metrics (
id INTEGER PRIMARY KEY AUTOINCREMENT,
spider_name TEXT,
timestamp TIMESTAMP,
metric_name TEXT,
metric_value REAL
)
''')
conn.commit()
conn.close()
def open_spider(self, spider):
self.spider_start_time = time.time()
self.spider_name = spider.name
def close_spider(self, spider):
conn = sqlite3.connect(self.db_path)
cursor = conn.cursor()
stats = spider.crawler.stats.get_stats()
end_time = time.time()
runtime = end_time - self.spider_start_time
cursor.execute('''
INSERT INTO spider_runs
(spider_name, start_time, end_time, items_scraped, pages_crawled,
errors, status, runtime_seconds, memory_usage, cpu_usage)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
''', (
self.spider_name,
datetime.fromtimestamp(self.spider_start_time),
datetime.fromtimestamp(end_time),
stats.get('item_scraped_count', 0),
stats.get('response_received_count', 0),
stats.get('spider_exceptions', 0),
'completed',
runtime,
stats.get('monitoring/memory_usage_end', 0),
stats.get('monitoring/cpu_usage_end', 0)
))
conn.commit()
conn.close()
def process_item(self, item, spider):
# Log real-time metrics
self.log_metric(spider.name, 'items_processed', 1)
return item
def log_metric(self, spider_name, metric_name, value):
conn = sqlite3.connect(self.db_path)
cursor = conn.cursor()
cursor.execute('''
INSERT INTO real_time_metrics (spider_name, timestamp, metric_name, metric_value)
VALUES (?, ?, ?, ?)
''', (spider_name, datetime.now(), metric_name, value))
conn.commit()
conn.close()
5. Email and Slack Alerting
Set up automated alerts for spider failures and performance issues.
# alerting.py
import smtplib
import requests
import json
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart
from scrapy import signals
class AlertingMiddleware:
def __init__(self, settings):
self.email_settings = {
'smtp_server': settings.get('ALERT_SMTP_SERVER'),
'smtp_port': settings.get('ALERT_SMTP_PORT', 587),
'username': settings.get('ALERT_EMAIL_USERNAME'),
'password': settings.get('ALERT_EMAIL_PASSWORD'),
'recipients': settings.get('ALERT_EMAIL_RECIPIENTS', [])
}
self.slack_webhook = settings.get('ALERT_SLACK_WEBHOOK')
self.error_threshold = settings.get('ALERT_ERROR_THRESHOLD', 10)
@classmethod
def from_crawler(cls, crawler):
middleware = cls(crawler.settings)
crawler.signals.connect(middleware.spider_error, signal=signals.spider_error)
crawler.signals.connect(middleware.spider_closed, signal=signals.spider_closed)
return middleware
def spider_error(self, failure, response, spider):
error_count = spider.crawler.stats.get_value('spider_exceptions', 0)
if error_count > self.error_threshold:
message = f"Spider {spider.name} has exceeded error threshold ({error_count} errors)"
self.send_alert("Spider Error Alert", message, "error")
def spider_closed(self, spider, reason):
if reason != 'finished':
message = f"Spider {spider.name} closed unexpectedly. Reason: {reason}"
self.send_alert("Spider Closure Alert", message, "warning")
def send_alert(self, subject, message, alert_type):
# Send email alert
if self.email_settings['smtp_server']:
self.send_email_alert(subject, message)
# Send Slack alert
if self.slack_webhook:
self.send_slack_alert(subject, message, alert_type)
def send_email_alert(self, subject, message):
try:
msg = MIMEMultipart()
msg['From'] = self.email_settings['username']
msg['To'] = ', '.join(self.email_settings['recipients'])
msg['Subject'] = subject
msg.attach(MIMEText(message, 'plain'))
server = smtplib.SMTP(self.email_settings['smtp_server'],
self.email_settings['smtp_port'])
server.starttls()
server.login(self.email_settings['username'],
self.email_settings['password'])
text = msg.as_string()
server.sendmail(self.email_settings['username'],
self.email_settings['recipients'], text)
server.quit()
except Exception as e:
print(f"Failed to send email alert: {e}")
def send_slack_alert(self, subject, message, alert_type):
color_map = {
'error': '#FF0000',
'warning': '#FFA500',
'info': '#00FF00'
}
payload = {
"attachments": [{
"color": color_map.get(alert_type, '#808080'),
"title": subject,
"text": message,
"ts": int(time.time())
}]
}
try:
response = requests.post(self.slack_webhook,
data=json.dumps(payload),
headers={'Content-Type': 'application/json'})
response.raise_for_status()
except Exception as e:
print(f"Failed to send Slack alert: {e}")
Monitoring Tools Integration
6. Prometheus and Grafana Integration
Export Scrapy metrics to Prometheus for advanced monitoring and visualization.
# prometheus_exporter.py
from prometheus_client import Counter, Histogram, Gauge, start_http_server
from scrapy import signals
import time
class PrometheusMiddleware:
def __init__(self):
# Define metrics
self.items_scraped = Counter('scrapy_items_scraped_total',
'Total items scraped', ['spider'])
self.requests_made = Counter('scrapy_requests_total',
'Total requests made', ['spider', 'status'])
self.response_time = Histogram('scrapy_response_time_seconds',
'Response time in seconds', ['spider'])
self.active_spiders = Gauge('scrapy_active_spiders',
'Number of active spiders')
self.spider_runtime = Gauge('scrapy_spider_runtime_seconds',
'Spider runtime in seconds', ['spider'])
# Start Prometheus metrics server
start_http_server(8000)
@classmethod
def from_crawler(cls, crawler):
middleware = cls()
crawler.signals.connect(middleware.spider_opened, signal=signals.spider_opened)
crawler.signals.connect(middleware.spider_closed, signal=signals.spider_closed)
crawler.signals.connect(middleware.item_scraped, signal=signals.item_scraped)
crawler.signals.connect(middleware.response_received, signal=signals.response_received)
return middleware
def spider_opened(self, spider):
self.active_spiders.inc()
spider.start_time = time.time()
def spider_closed(self, spider):
self.active_spiders.dec()
if hasattr(spider, 'start_time'):
runtime = time.time() - spider.start_time
self.spider_runtime.labels(spider=spider.name).set(runtime)
def item_scraped(self, item, response, spider):
self.items_scraped.labels(spider=spider.name).inc()
def response_received(self, response, request, spider):
self.requests_made.labels(spider=spider.name,
status=response.status).inc()
if hasattr(request, 'meta') and 'download_timeout' in request.meta:
response_time = time.time() - request.meta.get('start_time', time.time())
self.response_time.labels(spider=spider.name).observe(response_time)
Production Deployment Monitoring
7. Docker Container Monitoring
When running Scrapy in Docker containers, implement container-specific monitoring.
# docker-compose.yml with monitoring
version: '3.8'
services:
scrapy-spider:
build: .
environment:
- SCRAPY_SETTINGS_MODULE=myproject.settings.production
volumes:
- ./logs:/app/logs
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 30s
timeout: 10s
retries: 3
ports:
- "8080:8080" # Health check endpoint
- "8000:8000" # Prometheus metrics
prometheus:
image: prom/prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
grafana:
image: grafana/grafana
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
8. Command Line Monitoring Tools
Use command-line tools to monitor running spiders and system resources.
# Monitor Scrapy spider processes
ps aux | grep scrapy
# Check system resource usage
htop
# Monitor log files in real-time
tail -f spider_production.log
# Check spider statistics
curl http://localhost:8080/health | jq
# Monitor network connections
netstat -an | grep :80
# Check disk usage for logs and data
df -h
du -sh logs/
# Monitor memory usage of Scrapy processes
pmap -x $(pidof python)
9. Log Aggregation and Analysis
Set up centralized logging for multiple spider instances.
# structured_logging.py
import json
import logging
from datetime import datetime
class StructuredLogFormatter(logging.Formatter):
def format(self, record):
log_entry = {
'timestamp': datetime.utcnow().isoformat(),
'level': record.levelname,
'logger': record.name,
'message': record.getMessage(),
'module': record.module,
'function': record.funcName,
'line': record.lineno
}
# Add spider-specific context if available
if hasattr(record, 'spider'):
log_entry['spider'] = record.spider
if hasattr(record, 'url'):
log_entry['url'] = record.url
if hasattr(record, 'status_code'):
log_entry['status_code'] = record.status_code
return json.dumps(log_entry)
# Usage in settings.py
LOGGING = {
'version': 1,
'formatters': {
'structured': {
'()': 'myproject.logging.StructuredLogFormatter',
},
},
'handlers': {
'structured_file': {
'level': 'INFO',
'class': 'logging.handlers.RotatingFileHandler',
'filename': 'scrapy_structured.log',
'maxBytes': 10485760, # 10MB
'backupCount': 5,
'formatter': 'structured',
},
},
'loggers': {
'scrapy': {
'handlers': ['structured_file'],
'level': 'INFO',
},
},
}
Performance Monitoring Scripts
Create monitoring scripts to track key performance indicators.
# monitor.py
#!/usr/bin/env python3
import sqlite3
import time
import psutil
from datetime import datetime, timedelta
class SpiderMonitor:
def __init__(self, db_path='spider_monitoring.db'):
self.db_path = db_path
def get_recent_runs(self, hours=24):
"""Get spider runs from the last N hours"""
conn = sqlite3.connect(self.db_path)
cursor = conn.cursor()
since = datetime.now() - timedelta(hours=hours)
cursor.execute('''
SELECT spider_name, start_time, end_time, items_scraped,
pages_crawled, errors, runtime_seconds
FROM spider_runs
WHERE start_time > ?
ORDER BY start_time DESC
''', (since,))
runs = cursor.fetchall()
conn.close()
return runs
def get_performance_stats(self):
"""Calculate performance statistics"""
runs = self.get_recent_runs(24)
if not runs:
return {"message": "No recent runs found"}
total_items = sum(run[3] for run in runs)
total_pages = sum(run[4] for run in runs)
total_errors = sum(run[5] for run in runs)
avg_runtime = sum(run[6] for run in runs) / len(runs)
return {
"total_runs": len(runs),
"total_items_scraped": total_items,
"total_pages_crawled": total_pages,
"total_errors": total_errors,
"average_runtime": avg_runtime,
"error_rate": (total_errors / total_pages) * 100 if total_pages > 0 else 0,
"items_per_hour": total_items / 24
}
def check_system_health(self):
"""Check system resource usage"""
return {
"cpu_percent": psutil.cpu_percent(interval=1),
"memory_percent": psutil.virtual_memory().percent,
"disk_percent": psutil.disk_usage('/').percent,
"load_average": psutil.getloadavg() if hasattr(psutil, 'getloadavg') else None
}
def generate_report(self):
"""Generate comprehensive monitoring report"""
perf_stats = self.get_performance_stats()
system_health = self.check_system_health()
report = {
"timestamp": datetime.now().isoformat(),
"performance": perf_stats,
"system_health": system_health
}
return report
if __name__ == "__main__":
monitor = SpiderMonitor()
report = monitor.generate_report()
print(json.dumps(report, indent=2))
Best Practices for Production Monitoring
Performance Optimization
- Monitor memory usage and implement memory management strategies
- Track request/response ratios to identify bottlenecks
- Set up automated scaling based on performance metrics
- Use connection pooling and optimize concurrent requests
Error Handling
- Implement comprehensive error categorization and tracking
- Set up automated retry mechanisms for transient failures
- Monitor error patterns to identify systematic issues
- Create alerts for critical failure thresholds
Security Monitoring
- Track blocked requests and potential bot detection
- Monitor IP rotation and proxy performance
- Log authentication failures and security-related events
- Implement rate limiting monitoring
Resource Management
- Monitor disk space usage for logs and data storage
- Track network bandwidth consumption
- Implement resource usage alerts and automatic cleanup
- Set up log rotation and archival policies
Conclusion
Effective monitoring of Scrapy spiders in production requires a multi-layered approach combining logging, metrics collection, health checks, and alerting. By implementing comprehensive monitoring strategies similar to those used in monitoring network requests in browser automation, you can ensure reliable and efficient web scraping operations.
The monitoring setup should evolve with your infrastructure needs, incorporating tools like Prometheus, Grafana, and centralized logging systems. Regular review of monitoring data helps optimize spider performance and prevent issues before they impact your scraping operations. For complex deployment scenarios, consider implementing containerized monitoring solutions to maintain consistency across different environments.
Remember to establish monitoring baselines, set up automated alerts for anomalies, and regularly review and update your monitoring strategy as your scraping requirements evolve. Effective monitoring not only helps maintain system reliability but also provides valuable insights for optimizing your web scraping infrastructure.