How can I implement logging and monitoring for PHP web scraping projects?
Implementing proper logging and monitoring is crucial for maintaining reliable PHP web scraping projects. Effective monitoring helps you track performance, debug issues, and ensure your scrapers run smoothly in production environments. This guide covers comprehensive logging strategies, monitoring techniques, and best practices for PHP web scraping applications.
Understanding the Importance of Logging and Monitoring
Web scraping operations face unique challenges including network failures, anti-bot measures, rate limiting, and dynamic content changes. Without proper logging and monitoring, these issues can go undetected, leading to data loss and system failures. A robust monitoring system provides:
- Real-time visibility into scraping operations
- Error detection and alerting for immediate issue resolution
- Performance metrics to optimize scraping efficiency
- Audit trails for compliance and debugging
- Resource utilization tracking to prevent system overload
Setting Up PSR-3 Compatible Logging
The PHP-FIG PSR-3 logging standard provides a consistent interface for logging across different libraries. Monolog is the most popular PSR-3 compatible logging library for PHP applications.
Installing and Configuring Monolog
composer require monolog/monolog
Create a comprehensive logging configuration:
<?php
use Monolog\Logger;
use Monolog\Handler\StreamHandler;
use Monolog\Handler\RotatingFileHandler;
use Monolog\Handler\SlackWebhookHandler;
use Monolog\Formatter\LineFormatter;
use Monolog\Processor\IntrospectionProcessor;
use Monolog\Processor\MemoryUsageProcessor;
class ScrapingLogger
{
private $logger;
public function __construct($name = 'web-scraper')
{
$this->logger = new Logger($name);
// Console output for development
$consoleHandler = new StreamHandler('php://stdout', Logger::DEBUG);
$consoleHandler->setFormatter(new LineFormatter(
"[%datetime%] %level_name%: %message% %context%\n"
));
// Rotating file handler for production
$fileHandler = new RotatingFileHandler(
'/var/log/scraper/scraper.log',
7, // Keep 7 days of logs
Logger::INFO
);
// Critical errors to Slack
$slackHandler = new SlackWebhookHandler(
'YOUR_SLACK_WEBHOOK_URL',
null,
'scraper-bot',
true,
null,
Logger::ERROR
);
// Add processors for additional context
$this->logger->pushProcessor(new IntrospectionProcessor());
$this->logger->pushProcessor(new MemoryUsageProcessor());
$this->logger->pushHandler($consoleHandler);
$this->logger->pushHandler($fileHandler);
$this->logger->pushHandler($slackHandler);
}
public function getLogger()
{
return $this->logger;
}
}
Implementing Structured Logging for Scraping Operations
Structured logging with consistent context makes log analysis much easier. Create a specialized scraping logger class:
<?php
class ScrapingOperationLogger
{
private $logger;
private $sessionId;
private $startTime;
public function __construct($logger, $sessionId = null)
{
$this->logger = $logger;
$this->sessionId = $sessionId ?: uniqid('scrape_');
$this->startTime = microtime(true);
}
public function logRequest($url, $method = 'GET', $headers = [])
{
$this->logger->info('HTTP Request initiated', [
'session_id' => $this->sessionId,
'url' => $url,
'method' => $method,
'headers' => $headers,
'timestamp' => date('c')
]);
}
public function logResponse($url, $statusCode, $responseTime, $contentLength = null)
{
$level = $statusCode >= 400 ? 'error' : 'info';
$this->logger->log($level, 'HTTP Response received', [
'session_id' => $this->sessionId,
'url' => $url,
'status_code' => $statusCode,
'response_time_ms' => round($responseTime * 1000, 2),
'content_length' => $contentLength,
'timestamp' => date('c')
]);
}
public function logDataExtraction($url, $extractedCount, $extractionTime)
{
$this->logger->info('Data extraction completed', [
'session_id' => $this->sessionId,
'url' => $url,
'extracted_items' => $extractedCount,
'extraction_time_ms' => round($extractionTime * 1000, 2),
'timestamp' => date('c')
]);
}
public function logError($message, $context = [])
{
$this->logger->error($message, array_merge($context, [
'session_id' => $this->sessionId,
'timestamp' => date('c')
]));
}
public function logSessionSummary($totalRequests, $successfulRequests, $totalItems)
{
$duration = microtime(true) - $this->startTime;
$successRate = $totalRequests > 0 ? ($successfulRequests / $totalRequests) * 100 : 0;
$this->logger->info('Scraping session completed', [
'session_id' => $this->sessionId,
'total_requests' => $totalRequests,
'successful_requests' => $successfulRequests,
'success_rate_percent' => round($successRate, 2),
'total_items_extracted' => $totalItems,
'session_duration_seconds' => round($duration, 2),
'requests_per_minute' => round(($totalRequests / $duration) * 60, 2),
'timestamp' => date('c')
]);
}
}
Performance Monitoring and Metrics Collection
Implement comprehensive performance monitoring to track scraping efficiency and system health:
<?php
class ScrapingMetrics
{
private $metrics = [];
private $startTime;
public function __construct()
{
$this->startTime = microtime(true);
$this->metrics = [
'requests_total' => 0,
'requests_successful' => 0,
'requests_failed' => 0,
'response_times' => [],
'status_codes' => [],
'memory_usage' => [],
'items_extracted' => 0,
'errors' => []
];
}
public function recordRequest($url, $statusCode, $responseTime, $memoryUsage = null)
{
$this->metrics['requests_total']++;
if ($statusCode >= 200 && $statusCode < 400) {
$this->metrics['requests_successful']++;
} else {
$this->metrics['requests_failed']++;
}
$this->metrics['response_times'][] = $responseTime;
$this->metrics['status_codes'][$statusCode] =
($this->metrics['status_codes'][$statusCode] ?? 0) + 1;
if ($memoryUsage) {
$this->metrics['memory_usage'][] = $memoryUsage;
}
}
public function recordDataExtraction($itemCount)
{
$this->metrics['items_extracted'] += $itemCount;
}
public function recordError($error, $url = null)
{
$this->metrics['errors'][] = [
'error' => $error,
'url' => $url,
'timestamp' => date('c')
];
}
public function getMetricsSummary()
{
$duration = microtime(true) - $this->startTime;
$responseTimes = $this->metrics['response_times'];
return [
'session_duration' => round($duration, 2),
'total_requests' => $this->metrics['requests_total'],
'successful_requests' => $this->metrics['requests_successful'],
'failed_requests' => $this->metrics['requests_failed'],
'success_rate' => $this->calculateSuccessRate(),
'average_response_time' => $this->calculateAverageResponseTime(),
'min_response_time' => !empty($responseTimes) ? min($responseTimes) : 0,
'max_response_time' => !empty($responseTimes) ? max($responseTimes) : 0,
'requests_per_minute' => round(($this->metrics['requests_total'] / $duration) * 60, 2),
'items_per_minute' => round(($this->metrics['items_extracted'] / $duration) * 60, 2),
'status_code_distribution' => $this->metrics['status_codes'],
'total_items_extracted' => $this->metrics['items_extracted'],
'total_errors' => count($this->metrics['errors']),
'memory_usage' => $this->calculateMemoryStats()
];
}
private function calculateSuccessRate()
{
return $this->metrics['requests_total'] > 0
? round(($this->metrics['requests_successful'] / $this->metrics['requests_total']) * 100, 2)
: 0;
}
private function calculateAverageResponseTime()
{
$times = $this->metrics['response_times'];
return !empty($times) ? round(array_sum($times) / count($times), 3) : 0;
}
private function calculateMemoryStats()
{
$memory = $this->metrics['memory_usage'];
if (empty($memory)) {
return ['current' => memory_get_usage(true), 'peak' => memory_get_peak_usage(true)];
}
return [
'average' => round(array_sum($memory) / count($memory)),
'min' => min($memory),
'max' => max($memory),
'current' => memory_get_usage(true),
'peak' => memory_get_peak_usage(true)
];
}
}
Real-time Monitoring and Alerting
Implement real-time monitoring capabilities to detect issues immediately:
<?php
class ScrapingMonitor
{
private $logger;
private $metrics;
private $thresholds;
public function __construct($logger, $metrics)
{
$this->logger = $logger;
$this->metrics = $metrics;
$this->thresholds = [
'max_response_time' => 10.0, // seconds
'min_success_rate' => 85.0, // percentage
'max_error_rate' => 15.0, // percentage
'max_memory_usage' => 512 * 1024 * 1024, // 512MB
'max_consecutive_failures' => 5
];
}
public function checkHealthStatus()
{
$summary = $this->metrics->getMetricsSummary();
$alerts = [];
// Check response time
if ($summary['average_response_time'] > $this->thresholds['max_response_time']) {
$alerts[] = [
'level' => 'warning',
'message' => 'High average response time detected',
'value' => $summary['average_response_time'],
'threshold' => $this->thresholds['max_response_time']
];
}
// Check success rate
if ($summary['success_rate'] < $this->thresholds['min_success_rate']) {
$alerts[] = [
'level' => 'critical',
'message' => 'Low success rate detected',
'value' => $summary['success_rate'],
'threshold' => $this->thresholds['min_success_rate']
];
}
// Check memory usage
$currentMemory = memory_get_usage(true);
if ($currentMemory > $this->thresholds['max_memory_usage']) {
$alerts[] = [
'level' => 'warning',
'message' => 'High memory usage detected',
'value' => round($currentMemory / 1024 / 1024, 2) . 'MB',
'threshold' => round($this->thresholds['max_memory_usage'] / 1024 / 1024, 2) . 'MB'
];
}
// Send alerts if any issues detected
foreach ($alerts as $alert) {
$this->sendAlert($alert);
}
return [
'status' => empty($alerts) ? 'healthy' : 'warning',
'alerts' => $alerts,
'metrics' => $summary
];
}
private function sendAlert($alert)
{
$level = $alert['level'] === 'critical' ? 'critical' : 'warning';
$this->logger->log($level, $alert['message'], [
'current_value' => $alert['value'],
'threshold' => $alert['threshold'],
'alert_level' => $alert['level'],
'timestamp' => date('c')
]);
}
}
Integrating with External Monitoring Services
For production environments, integrate with external monitoring services for comprehensive observability:
<?php
// Example integration with StatsD/Datadog
class ExternalMetricsReporter
{
private $statsd;
public function __construct($statsdHost = 'localhost', $statsdPort = 8125)
{
$this->statsd = new \Domnikl\Statsd\Client(
new \Domnikl\Statsd\Connection\UdpSocket($statsdHost, $statsdPort),
'webscraper'
);
}
public function reportMetrics($metrics)
{
// Counter metrics
$this->statsd->count('requests.total', $metrics['total_requests']);
$this->statsd->count('requests.successful', $metrics['successful_requests']);
$this->statsd->count('requests.failed', $metrics['failed_requests']);
$this->statsd->count('items.extracted', $metrics['total_items_extracted']);
// Gauge metrics
$this->statsd->gauge('performance.success_rate', $metrics['success_rate']);
$this->statsd->gauge('performance.requests_per_minute', $metrics['requests_per_minute']);
$this->statsd->gauge('memory.current', $metrics['memory_usage']['current']);
// Timing metrics
$this->statsd->timing('response_time.average', $metrics['average_response_time'] * 1000);
$this->statsd->timing('response_time.max', $metrics['max_response_time'] * 1000);
}
}
Error Handling and Recovery Logging
Implement comprehensive error handling with detailed logging:
<?php
class ScrapingErrorHandler
{
private $logger;
private $retryAttempts = [];
public function __construct($logger)
{
$this->logger = $logger;
}
public function handleError($error, $url, $context = [])
{
$errorId = uniqid('error_');
$this->logger->error('Scraping error occurred', [
'error_id' => $errorId,
'error_message' => $error->getMessage(),
'error_code' => $error->getCode(),
'url' => $url,
'context' => $context,
'stack_trace' => $error->getTraceAsString(),
'timestamp' => date('c')
]);
return $errorId;
}
public function logRetryAttempt($url, $attempt, $maxAttempts, $delay)
{
$this->retryAttempts[$url] = ($this->retryAttempts[$url] ?? 0) + 1;
$this->logger->warning('Retry attempt initiated', [
'url' => $url,
'attempt' => $attempt,
'max_attempts' => $maxAttempts,
'delay_seconds' => $delay,
'total_retries' => $this->retryAttempts[$url],
'timestamp' => date('c')
]);
}
public function logRecovery($url, $finalAttempt)
{
$this->logger->info('Request recovered successfully', [
'url' => $url,
'successful_attempt' => $finalAttempt,
'total_retries' => $this->retryAttempts[$url] ?? 0,
'timestamp' => date('c')
]);
}
}
Console Commands for Log Analysis
Create console commands for analyzing scraping logs:
#!/bin/bash
# analyze_scraping_logs.sh
echo "=== Scraping Session Analysis ==="
# Extract session summaries
echo "Recent Session Summaries:"
grep "Scraping session completed" /var/log/scraper/scraper.log | tail -5
# Calculate average success rate
echo -e "\nAverage Success Rate (Last 24 hours):"
grep "$(date -d '1 day ago' '+%Y-%m-%d')" /var/log/scraper/scraper.log | \
grep "success_rate_percent" | \
awk -F'success_rate_percent":' '{print $2}' | \
awk -F',' '{sum+=$1; count++} END {if(count>0) print sum/count"%"}'
# Top error URLs
echo -e "\nTop Error URLs:"
grep "ERROR" /var/log/scraper/scraper.log | \
grep -o '"url":"[^"]*"' | \
sort | uniq -c | sort -nr | head -10
# Memory usage trends
echo -e "\nMemory Usage Trends:"
grep "memory_usage" /var/log/scraper/scraper.log | tail -10
Integration with PHP Web Scraping Workflows
When implementing monitoring in complex scraping scenarios, such as when handling errors in Puppeteer or managing timeouts in Puppeteer, ensure your logging captures cross-technology interactions and provides comprehensive debugging information.
Practical Implementation Example
<?php
// Complete implementation example
$logger = new ScrapingLogger();
$metrics = new ScrapingMetrics();
$monitor = new ScrapingMonitor($logger->getLogger(), $metrics);
$errorHandler = new ScrapingErrorHandler($logger->getLogger());
$operationLogger = new ScrapingOperationLogger($logger->getLogger());
try {
$startTime = microtime(true);
// Make HTTP request
$operationLogger->logRequest($url, 'GET', $headers);
$response = makeHttpRequest($url, $headers);
$responseTime = microtime(true) - $startTime;
// Log response and record metrics
$operationLogger->logResponse($url, $response->getStatusCode(), $responseTime, strlen($response->getBody()));
$metrics->recordRequest($url, $response->getStatusCode(), $responseTime, memory_get_usage(true));
// Extract data
$extractionStart = microtime(true);
$extractedData = extractData($response->getBody());
$extractionTime = microtime(true) - $extractionStart;
$operationLogger->logDataExtraction($url, count($extractedData), $extractionTime);
$metrics->recordDataExtraction(count($extractedData));
} catch (Exception $e) {
$errorId = $errorHandler->handleError($e, $url, ['headers' => $headers]);
$metrics->recordError($e->getMessage(), $url);
// Implement retry logic with logging
for ($attempt = 1; $attempt <= 3; $attempt++) {
$errorHandler->logRetryAttempt($url, $attempt, 3, 5);
sleep(5);
try {
// Retry logic here
$errorHandler->logRecovery($url, $attempt);
break;
} catch (Exception $retryError) {
if ($attempt === 3) {
$errorHandler->handleError($retryError, $url, ['final_attempt' => true]);
}
}
}
}
// Check health status and generate alerts
$healthStatus = $monitor->checkHealthStatus();
if ($healthStatus['status'] !== 'healthy') {
// Handle alerts
foreach ($healthStatus['alerts'] as $alert) {
// Send notifications, trigger actions, etc.
}
}
Best Practices for Production Monitoring
1. Log Rotation and Retention
Configure automatic log rotation to prevent disk space issues:
// Use RotatingFileHandler with appropriate retention
$handler = new RotatingFileHandler('/var/log/scraper.log', 30, Logger::INFO);
2. Structured Context
Always include relevant context in logs for easier analysis and debugging.
3. Performance Impact
Monitor the performance impact of logging itself, especially in high-throughput scenarios.
4. Security Considerations
Never log sensitive information like passwords, API keys, or personal data.
5. Alerting Thresholds
Set appropriate alerting thresholds based on your specific requirements and historical data.
Conclusion
Implementing comprehensive logging and monitoring for PHP web scraping projects is essential for maintaining reliable, efficient operations. By combining structured logging with real-time monitoring and alerting, you can quickly identify and resolve issues while gaining valuable insights into your scraping performance. Remember to balance the detail level of logging with system performance, and always consider security implications when designing your monitoring strategy.
The combination of PSR-3 compatible logging, performance metrics, error handling, and external monitoring integration provides a robust foundation for production web scraping applications. Regular analysis of logs and metrics will help you optimize your scraping strategies and maintain high success rates over time.