Table of contents

How can I implement logging and monitoring for PHP web scraping projects?

Implementing proper logging and monitoring is crucial for maintaining reliable PHP web scraping projects. Effective monitoring helps you track performance, debug issues, and ensure your scrapers run smoothly in production environments. This guide covers comprehensive logging strategies, monitoring techniques, and best practices for PHP web scraping applications.

Understanding the Importance of Logging and Monitoring

Web scraping operations face unique challenges including network failures, anti-bot measures, rate limiting, and dynamic content changes. Without proper logging and monitoring, these issues can go undetected, leading to data loss and system failures. A robust monitoring system provides:

  • Real-time visibility into scraping operations
  • Error detection and alerting for immediate issue resolution
  • Performance metrics to optimize scraping efficiency
  • Audit trails for compliance and debugging
  • Resource utilization tracking to prevent system overload

Setting Up PSR-3 Compatible Logging

The PHP-FIG PSR-3 logging standard provides a consistent interface for logging across different libraries. Monolog is the most popular PSR-3 compatible logging library for PHP applications.

Installing and Configuring Monolog

composer require monolog/monolog

Create a comprehensive logging configuration:

<?php
use Monolog\Logger;
use Monolog\Handler\StreamHandler;
use Monolog\Handler\RotatingFileHandler;
use Monolog\Handler\SlackWebhookHandler;
use Monolog\Formatter\LineFormatter;
use Monolog\Processor\IntrospectionProcessor;
use Monolog\Processor\MemoryUsageProcessor;

class ScrapingLogger
{
    private $logger;

    public function __construct($name = 'web-scraper')
    {
        $this->logger = new Logger($name);

        // Console output for development
        $consoleHandler = new StreamHandler('php://stdout', Logger::DEBUG);
        $consoleHandler->setFormatter(new LineFormatter(
            "[%datetime%] %level_name%: %message% %context%\n"
        ));

        // Rotating file handler for production
        $fileHandler = new RotatingFileHandler(
            '/var/log/scraper/scraper.log',
            7, // Keep 7 days of logs
            Logger::INFO
        );

        // Critical errors to Slack
        $slackHandler = new SlackWebhookHandler(
            'YOUR_SLACK_WEBHOOK_URL',
            null,
            'scraper-bot',
            true,
            null,
            Logger::ERROR
        );

        // Add processors for additional context
        $this->logger->pushProcessor(new IntrospectionProcessor());
        $this->logger->pushProcessor(new MemoryUsageProcessor());

        $this->logger->pushHandler($consoleHandler);
        $this->logger->pushHandler($fileHandler);
        $this->logger->pushHandler($slackHandler);
    }

    public function getLogger()
    {
        return $this->logger;
    }
}

Implementing Structured Logging for Scraping Operations

Structured logging with consistent context makes log analysis much easier. Create a specialized scraping logger class:

<?php
class ScrapingOperationLogger
{
    private $logger;
    private $sessionId;
    private $startTime;

    public function __construct($logger, $sessionId = null)
    {
        $this->logger = $logger;
        $this->sessionId = $sessionId ?: uniqid('scrape_');
        $this->startTime = microtime(true);
    }

    public function logRequest($url, $method = 'GET', $headers = [])
    {
        $this->logger->info('HTTP Request initiated', [
            'session_id' => $this->sessionId,
            'url' => $url,
            'method' => $method,
            'headers' => $headers,
            'timestamp' => date('c')
        ]);
    }

    public function logResponse($url, $statusCode, $responseTime, $contentLength = null)
    {
        $level = $statusCode >= 400 ? 'error' : 'info';

        $this->logger->log($level, 'HTTP Response received', [
            'session_id' => $this->sessionId,
            'url' => $url,
            'status_code' => $statusCode,
            'response_time_ms' => round($responseTime * 1000, 2),
            'content_length' => $contentLength,
            'timestamp' => date('c')
        ]);
    }

    public function logDataExtraction($url, $extractedCount, $extractionTime)
    {
        $this->logger->info('Data extraction completed', [
            'session_id' => $this->sessionId,
            'url' => $url,
            'extracted_items' => $extractedCount,
            'extraction_time_ms' => round($extractionTime * 1000, 2),
            'timestamp' => date('c')
        ]);
    }

    public function logError($message, $context = [])
    {
        $this->logger->error($message, array_merge($context, [
            'session_id' => $this->sessionId,
            'timestamp' => date('c')
        ]));
    }

    public function logSessionSummary($totalRequests, $successfulRequests, $totalItems)
    {
        $duration = microtime(true) - $this->startTime;
        $successRate = $totalRequests > 0 ? ($successfulRequests / $totalRequests) * 100 : 0;

        $this->logger->info('Scraping session completed', [
            'session_id' => $this->sessionId,
            'total_requests' => $totalRequests,
            'successful_requests' => $successfulRequests,
            'success_rate_percent' => round($successRate, 2),
            'total_items_extracted' => $totalItems,
            'session_duration_seconds' => round($duration, 2),
            'requests_per_minute' => round(($totalRequests / $duration) * 60, 2),
            'timestamp' => date('c')
        ]);
    }
}

Performance Monitoring and Metrics Collection

Implement comprehensive performance monitoring to track scraping efficiency and system health:

<?php
class ScrapingMetrics
{
    private $metrics = [];
    private $startTime;

    public function __construct()
    {
        $this->startTime = microtime(true);
        $this->metrics = [
            'requests_total' => 0,
            'requests_successful' => 0,
            'requests_failed' => 0,
            'response_times' => [],
            'status_codes' => [],
            'memory_usage' => [],
            'items_extracted' => 0,
            'errors' => []
        ];
    }

    public function recordRequest($url, $statusCode, $responseTime, $memoryUsage = null)
    {
        $this->metrics['requests_total']++;

        if ($statusCode >= 200 && $statusCode < 400) {
            $this->metrics['requests_successful']++;
        } else {
            $this->metrics['requests_failed']++;
        }

        $this->metrics['response_times'][] = $responseTime;
        $this->metrics['status_codes'][$statusCode] = 
            ($this->metrics['status_codes'][$statusCode] ?? 0) + 1;

        if ($memoryUsage) {
            $this->metrics['memory_usage'][] = $memoryUsage;
        }
    }

    public function recordDataExtraction($itemCount)
    {
        $this->metrics['items_extracted'] += $itemCount;
    }

    public function recordError($error, $url = null)
    {
        $this->metrics['errors'][] = [
            'error' => $error,
            'url' => $url,
            'timestamp' => date('c')
        ];
    }

    public function getMetricsSummary()
    {
        $duration = microtime(true) - $this->startTime;
        $responseTimes = $this->metrics['response_times'];

        return [
            'session_duration' => round($duration, 2),
            'total_requests' => $this->metrics['requests_total'],
            'successful_requests' => $this->metrics['requests_successful'],
            'failed_requests' => $this->metrics['requests_failed'],
            'success_rate' => $this->calculateSuccessRate(),
            'average_response_time' => $this->calculateAverageResponseTime(),
            'min_response_time' => !empty($responseTimes) ? min($responseTimes) : 0,
            'max_response_time' => !empty($responseTimes) ? max($responseTimes) : 0,
            'requests_per_minute' => round(($this->metrics['requests_total'] / $duration) * 60, 2),
            'items_per_minute' => round(($this->metrics['items_extracted'] / $duration) * 60, 2),
            'status_code_distribution' => $this->metrics['status_codes'],
            'total_items_extracted' => $this->metrics['items_extracted'],
            'total_errors' => count($this->metrics['errors']),
            'memory_usage' => $this->calculateMemoryStats()
        ];
    }

    private function calculateSuccessRate()
    {
        return $this->metrics['requests_total'] > 0 
            ? round(($this->metrics['requests_successful'] / $this->metrics['requests_total']) * 100, 2)
            : 0;
    }

    private function calculateAverageResponseTime()
    {
        $times = $this->metrics['response_times'];
        return !empty($times) ? round(array_sum($times) / count($times), 3) : 0;
    }

    private function calculateMemoryStats()
    {
        $memory = $this->metrics['memory_usage'];
        if (empty($memory)) {
            return ['current' => memory_get_usage(true), 'peak' => memory_get_peak_usage(true)];
        }

        return [
            'average' => round(array_sum($memory) / count($memory)),
            'min' => min($memory),
            'max' => max($memory),
            'current' => memory_get_usage(true),
            'peak' => memory_get_peak_usage(true)
        ];
    }
}

Real-time Monitoring and Alerting

Implement real-time monitoring capabilities to detect issues immediately:

<?php
class ScrapingMonitor
{
    private $logger;
    private $metrics;
    private $thresholds;

    public function __construct($logger, $metrics)
    {
        $this->logger = $logger;
        $this->metrics = $metrics;
        $this->thresholds = [
            'max_response_time' => 10.0, // seconds
            'min_success_rate' => 85.0,  // percentage
            'max_error_rate' => 15.0,    // percentage
            'max_memory_usage' => 512 * 1024 * 1024, // 512MB
            'max_consecutive_failures' => 5
        ];
    }

    public function checkHealthStatus()
    {
        $summary = $this->metrics->getMetricsSummary();
        $alerts = [];

        // Check response time
        if ($summary['average_response_time'] > $this->thresholds['max_response_time']) {
            $alerts[] = [
                'level' => 'warning',
                'message' => 'High average response time detected',
                'value' => $summary['average_response_time'],
                'threshold' => $this->thresholds['max_response_time']
            ];
        }

        // Check success rate
        if ($summary['success_rate'] < $this->thresholds['min_success_rate']) {
            $alerts[] = [
                'level' => 'critical',
                'message' => 'Low success rate detected',
                'value' => $summary['success_rate'],
                'threshold' => $this->thresholds['min_success_rate']
            ];
        }

        // Check memory usage
        $currentMemory = memory_get_usage(true);
        if ($currentMemory > $this->thresholds['max_memory_usage']) {
            $alerts[] = [
                'level' => 'warning',
                'message' => 'High memory usage detected',
                'value' => round($currentMemory / 1024 / 1024, 2) . 'MB',
                'threshold' => round($this->thresholds['max_memory_usage'] / 1024 / 1024, 2) . 'MB'
            ];
        }

        // Send alerts if any issues detected
        foreach ($alerts as $alert) {
            $this->sendAlert($alert);
        }

        return [
            'status' => empty($alerts) ? 'healthy' : 'warning',
            'alerts' => $alerts,
            'metrics' => $summary
        ];
    }

    private function sendAlert($alert)
    {
        $level = $alert['level'] === 'critical' ? 'critical' : 'warning';

        $this->logger->log($level, $alert['message'], [
            'current_value' => $alert['value'],
            'threshold' => $alert['threshold'],
            'alert_level' => $alert['level'],
            'timestamp' => date('c')
        ]);
    }
}

Integrating with External Monitoring Services

For production environments, integrate with external monitoring services for comprehensive observability:

<?php
// Example integration with StatsD/Datadog
class ExternalMetricsReporter
{
    private $statsd;

    public function __construct($statsdHost = 'localhost', $statsdPort = 8125)
    {
        $this->statsd = new \Domnikl\Statsd\Client(
            new \Domnikl\Statsd\Connection\UdpSocket($statsdHost, $statsdPort),
            'webscraper'
        );
    }

    public function reportMetrics($metrics)
    {
        // Counter metrics
        $this->statsd->count('requests.total', $metrics['total_requests']);
        $this->statsd->count('requests.successful', $metrics['successful_requests']);
        $this->statsd->count('requests.failed', $metrics['failed_requests']);
        $this->statsd->count('items.extracted', $metrics['total_items_extracted']);

        // Gauge metrics
        $this->statsd->gauge('performance.success_rate', $metrics['success_rate']);
        $this->statsd->gauge('performance.requests_per_minute', $metrics['requests_per_minute']);
        $this->statsd->gauge('memory.current', $metrics['memory_usage']['current']);

        // Timing metrics
        $this->statsd->timing('response_time.average', $metrics['average_response_time'] * 1000);
        $this->statsd->timing('response_time.max', $metrics['max_response_time'] * 1000);
    }
}

Error Handling and Recovery Logging

Implement comprehensive error handling with detailed logging:

<?php
class ScrapingErrorHandler
{
    private $logger;
    private $retryAttempts = [];

    public function __construct($logger)
    {
        $this->logger = $logger;
    }

    public function handleError($error, $url, $context = [])
    {
        $errorId = uniqid('error_');

        $this->logger->error('Scraping error occurred', [
            'error_id' => $errorId,
            'error_message' => $error->getMessage(),
            'error_code' => $error->getCode(),
            'url' => $url,
            'context' => $context,
            'stack_trace' => $error->getTraceAsString(),
            'timestamp' => date('c')
        ]);

        return $errorId;
    }

    public function logRetryAttempt($url, $attempt, $maxAttempts, $delay)
    {
        $this->retryAttempts[$url] = ($this->retryAttempts[$url] ?? 0) + 1;

        $this->logger->warning('Retry attempt initiated', [
            'url' => $url,
            'attempt' => $attempt,
            'max_attempts' => $maxAttempts,
            'delay_seconds' => $delay,
            'total_retries' => $this->retryAttempts[$url],
            'timestamp' => date('c')
        ]);
    }

    public function logRecovery($url, $finalAttempt)
    {
        $this->logger->info('Request recovered successfully', [
            'url' => $url,
            'successful_attempt' => $finalAttempt,
            'total_retries' => $this->retryAttempts[$url] ?? 0,
            'timestamp' => date('c')
        ]);
    }
}

Console Commands for Log Analysis

Create console commands for analyzing scraping logs:

#!/bin/bash
# analyze_scraping_logs.sh

echo "=== Scraping Session Analysis ==="

# Extract session summaries
echo "Recent Session Summaries:"
grep "Scraping session completed" /var/log/scraper/scraper.log | tail -5

# Calculate average success rate
echo -e "\nAverage Success Rate (Last 24 hours):"
grep "$(date -d '1 day ago' '+%Y-%m-%d')" /var/log/scraper/scraper.log | \
grep "success_rate_percent" | \
awk -F'success_rate_percent":' '{print $2}' | \
awk -F',' '{sum+=$1; count++} END {if(count>0) print sum/count"%"}'

# Top error URLs
echo -e "\nTop Error URLs:"
grep "ERROR" /var/log/scraper/scraper.log | \
grep -o '"url":"[^"]*"' | \
sort | uniq -c | sort -nr | head -10

# Memory usage trends
echo -e "\nMemory Usage Trends:"
grep "memory_usage" /var/log/scraper/scraper.log | tail -10

Integration with PHP Web Scraping Workflows

When implementing monitoring in complex scraping scenarios, such as when handling errors in Puppeteer or managing timeouts in Puppeteer, ensure your logging captures cross-technology interactions and provides comprehensive debugging information.

Practical Implementation Example

<?php
// Complete implementation example
$logger = new ScrapingLogger();
$metrics = new ScrapingMetrics();
$monitor = new ScrapingMonitor($logger->getLogger(), $metrics);
$errorHandler = new ScrapingErrorHandler($logger->getLogger());
$operationLogger = new ScrapingOperationLogger($logger->getLogger());

try {
    $startTime = microtime(true);

    // Make HTTP request
    $operationLogger->logRequest($url, 'GET', $headers);
    $response = makeHttpRequest($url, $headers);
    $responseTime = microtime(true) - $startTime;

    // Log response and record metrics
    $operationLogger->logResponse($url, $response->getStatusCode(), $responseTime, strlen($response->getBody()));
    $metrics->recordRequest($url, $response->getStatusCode(), $responseTime, memory_get_usage(true));

    // Extract data
    $extractionStart = microtime(true);
    $extractedData = extractData($response->getBody());
    $extractionTime = microtime(true) - $extractionStart;

    $operationLogger->logDataExtraction($url, count($extractedData), $extractionTime);
    $metrics->recordDataExtraction(count($extractedData));

} catch (Exception $e) {
    $errorId = $errorHandler->handleError($e, $url, ['headers' => $headers]);
    $metrics->recordError($e->getMessage(), $url);

    // Implement retry logic with logging
    for ($attempt = 1; $attempt <= 3; $attempt++) {
        $errorHandler->logRetryAttempt($url, $attempt, 3, 5);
        sleep(5);

        try {
            // Retry logic here
            $errorHandler->logRecovery($url, $attempt);
            break;
        } catch (Exception $retryError) {
            if ($attempt === 3) {
                $errorHandler->handleError($retryError, $url, ['final_attempt' => true]);
            }
        }
    }
}

// Check health status and generate alerts
$healthStatus = $monitor->checkHealthStatus();
if ($healthStatus['status'] !== 'healthy') {
    // Handle alerts
    foreach ($healthStatus['alerts'] as $alert) {
        // Send notifications, trigger actions, etc.
    }
}

Best Practices for Production Monitoring

1. Log Rotation and Retention

Configure automatic log rotation to prevent disk space issues:

// Use RotatingFileHandler with appropriate retention
$handler = new RotatingFileHandler('/var/log/scraper.log', 30, Logger::INFO);

2. Structured Context

Always include relevant context in logs for easier analysis and debugging.

3. Performance Impact

Monitor the performance impact of logging itself, especially in high-throughput scenarios.

4. Security Considerations

Never log sensitive information like passwords, API keys, or personal data.

5. Alerting Thresholds

Set appropriate alerting thresholds based on your specific requirements and historical data.

Conclusion

Implementing comprehensive logging and monitoring for PHP web scraping projects is essential for maintaining reliable, efficient operations. By combining structured logging with real-time monitoring and alerting, you can quickly identify and resolve issues while gaining valuable insights into your scraping performance. Remember to balance the detail level of logging with system performance, and always consider security implications when designing your monitoring strategy.

The combination of PSR-3 compatible logging, performance metrics, error handling, and external monitoring integration provides a robust foundation for production web scraping applications. Regular analysis of logs and metrics will help you optimize your scraping strategies and maintain high success rates over time.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon