How can I monitor HTTP performance metrics in web scraping?
Monitoring HTTP performance metrics is crucial for maintaining efficient and reliable web scraping operations. By tracking key performance indicators like response times, throughput, error rates, and resource utilization, you can identify bottlenecks, optimize your scraping strategies, and ensure consistent data collection. This guide covers comprehensive approaches to implementing performance monitoring in your web scraping applications.
Key HTTP Performance Metrics to Monitor
Response Time Metrics
- Total Response Time: Complete request-response cycle duration
- DNS Lookup Time: Time to resolve domain names
- Connection Time: Time to establish TCP connection
- SSL Handshake Time: Time for SSL/TLS negotiation
- Time to First Byte (TTFB): Time until first response byte received
- Content Download Time: Time to download response body
Throughput Metrics
- Requests per Second (RPS): Number of requests processed per second
- Data Transfer Rate: Bytes downloaded per second
- Concurrent Connections: Number of simultaneous connections
- Queue Length: Number of pending requests
Error and Reliability Metrics
- Error Rate: Percentage of failed requests
- Status Code Distribution: Breakdown of HTTP response codes
- Timeout Rate: Percentage of requests that timeout
- Retry Success Rate: Success rate of retried requests
Python Implementation with Requests and Monitoring
Here's a comprehensive Python implementation using the requests
library with built-in performance monitoring:
import time
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from collections import defaultdict
import threading
from dataclasses import dataclass
from typing import Dict, List, Optional
import statistics
@dataclass
class RequestMetrics:
url: str
status_code: int
response_time: float
dns_lookup_time: float
connect_time: float
ssl_handshake_time: float
ttfb: float
content_length: int
error: Optional[str] = None
class PerformanceMonitor:
def __init__(self):
self.metrics: List[RequestMetrics] = []
self.error_counts = defaultdict(int)
self.status_codes = defaultdict(int)
self.lock = threading.Lock()
def record_request(self, metrics: RequestMetrics):
with self.lock:
self.metrics.append(metrics)
self.status_codes[metrics.status_code] += 1
if metrics.error:
self.error_counts[metrics.error] += 1
def get_stats(self) -> Dict:
with self.lock:
if not self.metrics:
return {}
response_times = [m.response_time for m in self.metrics]
ttfb_times = [m.ttfb for m in self.metrics]
return {
'total_requests': len(self.metrics),
'avg_response_time': statistics.mean(response_times),
'median_response_time': statistics.median(response_times),
'p95_response_time': statistics.quantiles(response_times, n=20)[18],
'avg_ttfb': statistics.mean(ttfb_times),
'error_rate': sum(self.error_counts.values()) / len(self.metrics),
'status_code_distribution': dict(self.status_codes),
'total_data_transferred': sum(m.content_length for m in self.metrics)
}
class MonitoredSession(requests.Session):
def __init__(self, monitor: PerformanceMonitor):
super().__init__()
self.monitor = monitor
# Configure retry strategy
retry_strategy = Retry(
total=3,
backoff_factor=1,
status_forcelist=[429, 500, 502, 503, 504],
)
adapter = HTTPAdapter(max_retries=retry_strategy)
self.mount("http://", adapter)
self.mount("https://", adapter)
def request(self, method, url, **kwargs):
start_time = time.time()
error = None
try:
# Add timing hooks
def response_hook(r, *args, **kwargs):
r.elapsed_total = time.time() - start_time
kwargs.setdefault('hooks', {})['response'] = response_hook
response = super().request(method, url, **kwargs)
# Extract timing information (simplified - real implementation would use urllib3 hooks)
metrics = RequestMetrics(
url=url,
status_code=response.status_code,
response_time=response.elapsed.total_seconds(),
dns_lookup_time=0, # Would extract from urllib3 connection
connect_time=0, # Would extract from urllib3 connection
ssl_handshake_time=0, # Would extract from urllib3 connection
ttfb=response.elapsed.total_seconds() * 0.7, # Approximation
content_length=len(response.content),
error=None if response.ok else f"HTTP {response.status_code}"
)
except Exception as e:
error = str(e)
metrics = RequestMetrics(
url=url,
status_code=0,
response_time=time.time() - start_time,
dns_lookup_time=0,
connect_time=0,
ssl_handshake_time=0,
ttfb=0,
content_length=0,
error=error
)
raise
finally:
self.monitor.record_request(metrics)
return response
# Usage example
monitor = PerformanceMonitor()
session = MonitoredSession(monitor)
# Perform scraping requests
urls = ['https://httpbin.org/delay/1', 'https://httpbin.org/status/200']
for url in urls:
try:
response = session.get(url, timeout=10)
print(f"Scraped {url}: {response.status_code}")
except Exception as e:
print(f"Error scraping {url}: {e}")
# Print performance statistics
stats = monitor.get_stats()
print("Performance Statistics:")
for key, value in stats.items():
print(f" {key}: {value}")
JavaScript/Node.js Implementation with Axios
Here's a Node.js implementation using Axios with comprehensive performance monitoring:
const axios = require('axios');
const { performance } = require('perf_hooks');
class HTTPPerformanceMonitor {
constructor() {
this.metrics = [];
this.errorCounts = new Map();
this.statusCodes = new Map();
}
recordRequest(metrics) {
this.metrics.push(metrics);
const statusCode = metrics.statusCode || 0;
this.statusCodes.set(statusCode, (this.statusCodes.get(statusCode) || 0) + 1);
if (metrics.error) {
this.errorCounts.set(metrics.error, (this.errorCounts.get(metrics.error) || 0) + 1);
}
}
getStats() {
if (this.metrics.length === 0) return {};
const responseTimes = this.metrics.map(m => m.responseTime);
responseTimes.sort((a, b) => a - b);
return {
totalRequests: this.metrics.length,
avgResponseTime: responseTimes.reduce((a, b) => a + b, 0) / responseTimes.length,
medianResponseTime: responseTimes[Math.floor(responseTimes.length / 2)],
p95ResponseTime: responseTimes[Math.floor(responseTimes.length * 0.95)],
errorRate: Array.from(this.errorCounts.values()).reduce((a, b) => a + b, 0) / this.metrics.length,
statusCodeDistribution: Object.fromEntries(this.statusCodes),
totalDataTransferred: this.metrics.reduce((sum, m) => sum + (m.contentLength || 0), 0)
};
}
}
class MonitoredAxiosClient {
constructor(monitor) {
this.monitor = monitor;
this.client = axios.create({
timeout: 10000,
maxRedirects: 5
});
// Add request interceptor for timing
this.client.interceptors.request.use(
(config) => {
config.metadata = { startTime: performance.now() };
return config;
},
(error) => Promise.reject(error)
);
// Add response interceptor for metrics collection
this.client.interceptors.response.use(
(response) => {
this.recordMetrics(response, null);
return response;
},
(error) => {
this.recordMetrics(error.response, error);
return Promise.reject(error);
}
);
}
recordMetrics(response, error) {
const endTime = performance.now();
const config = error ? error.config : response.config;
const startTime = config?.metadata?.startTime || endTime;
const metrics = {
url: config?.url || 'unknown',
statusCode: response?.status || 0,
responseTime: endTime - startTime,
contentLength: this.getContentLength(response),
error: error ? error.message : null,
timestamp: new Date().toISOString()
};
this.monitor.recordRequest(metrics);
}
getContentLength(response) {
if (!response) return 0;
// Try to get content length from various sources
const contentLength = response.headers['content-length'];
if (contentLength) return parseInt(contentLength, 10);
// Estimate from response data
if (response.data) {
if (typeof response.data === 'string') return response.data.length;
if (Buffer.isBuffer(response.data)) return response.data.length;
return JSON.stringify(response.data).length;
}
return 0;
}
async get(url, config = {}) {
return this.client.get(url, config);
}
async post(url, data, config = {}) {
return this.client.post(url, data, config);
}
}
// Usage example
async function performScrapingWithMonitoring() {
const monitor = new HTTPPerformanceMonitor();
const client = new MonitoredAxiosClient(monitor);
const urls = [
'https://httpbin.org/delay/1',
'https://httpbin.org/json',
'https://httpbin.org/status/200'
];
// Perform concurrent requests
const promises = urls.map(async (url) => {
try {
const response = await client.get(url);
console.log(`Scraped ${url}: ${response.status}`);
return response;
} catch (error) {
console.error(`Error scraping ${url}:`, error.message);
}
});
await Promise.allSettled(promises);
// Print performance statistics
const stats = monitor.getStats();
console.log('\nPerformance Statistics:');
console.log(JSON.stringify(stats, null, 2));
}
performScrapingWithMonitoring();
Advanced Monitoring with Custom Instrumentation
For more detailed performance monitoring, you can implement custom instrumentation that tracks specific aspects of your scraping operations:
import time
import psutil
import threading
from contextlib import contextmanager
from dataclasses import dataclass, field
from typing import Dict, Any
import json
@dataclass
class SystemMetrics:
cpu_percent: float
memory_percent: float
network_io: Dict[str, int]
disk_io: Dict[str, int]
timestamp: float = field(default_factory=time.time)
class ComprehensiveMonitor:
def __init__(self):
self.request_metrics = []
self.system_metrics = []
self.active_requests = 0
self.lock = threading.Lock()
# Start system monitoring thread
self.monitoring = True
self.monitor_thread = threading.Thread(target=self._monitor_system)
self.monitor_thread.daemon = True
self.monitor_thread.start()
def _monitor_system(self):
"""Background thread to collect system metrics"""
while self.monitoring:
try:
cpu_percent = psutil.cpu_percent(interval=1)
memory = psutil.virtual_memory()
network = psutil.net_io_counters()
disk = psutil.disk_io_counters()
metrics = SystemMetrics(
cpu_percent=cpu_percent,
memory_percent=memory.percent,
network_io={
'bytes_sent': network.bytes_sent,
'bytes_recv': network.bytes_recv
},
disk_io={
'read_bytes': disk.read_bytes,
'write_bytes': disk.write_bytes
}
)
with self.lock:
self.system_metrics.append(metrics)
# Keep only last 100 measurements
if len(self.system_metrics) > 100:
self.system_metrics.pop(0)
except Exception as e:
print(f"System monitoring error: {e}")
time.sleep(5) # Collect metrics every 5 seconds
@contextmanager
def track_request(self, url: str):
"""Context manager to track individual request performance"""
start_time = time.time()
with self.lock:
self.active_requests += 1
try:
yield
finally:
end_time = time.time()
with self.lock:
self.active_requests -= 1
self.request_metrics.append({
'url': url,
'duration': end_time - start_time,
'timestamp': start_time,
'concurrent_requests': self.active_requests + 1
})
def get_comprehensive_stats(self) -> Dict[str, Any]:
with self.lock:
if not self.request_metrics:
return {'error': 'No request data available'}
# Calculate request statistics
durations = [r['duration'] for r in self.request_metrics]
concurrent_counts = [r['concurrent_requests'] for r in self.request_metrics]
# Calculate system resource usage during scraping
if self.system_metrics:
avg_cpu = sum(m.cpu_percent for m in self.system_metrics) / len(self.system_metrics)
avg_memory = sum(m.memory_percent for m in self.system_metrics) / len(self.system_metrics)
peak_cpu = max(m.cpu_percent for m in self.system_metrics)
peak_memory = max(m.memory_percent for m in self.system_metrics)
else:
avg_cpu = avg_memory = peak_cpu = peak_memory = 0
return {
'request_stats': {
'total_requests': len(self.request_metrics),
'avg_duration': sum(durations) / len(durations),
'max_duration': max(durations),
'min_duration': min(durations),
'avg_concurrent': sum(concurrent_counts) / len(concurrent_counts),
'max_concurrent': max(concurrent_counts)
},
'system_stats': {
'avg_cpu_percent': avg_cpu,
'peak_cpu_percent': peak_cpu,
'avg_memory_percent': avg_memory,
'peak_memory_percent': peak_memory
},
'throughput': {
'requests_per_second': len(self.request_metrics) / (durations[-1] if durations else 1),
'active_requests': self.active_requests
}
}
def stop_monitoring(self):
self.monitoring = False
if self.monitor_thread.is_alive():
self.monitor_thread.join()
# Usage example
monitor = ComprehensiveMonitor()
def scrape_with_monitoring(urls):
"""Example scraping function with comprehensive monitoring"""
import requests
session = requests.Session()
session.headers.update({'User-Agent': 'Performance-Monitored-Scraper'})
for url in urls:
with monitor.track_request(url):
try:
response = session.get(url, timeout=10)
print(f"Scraped {url}: {response.status_code} ({len(response.content)} bytes)")
time.sleep(0.1) # Simulate processing time
except Exception as e:
print(f"Error scraping {url}: {e}")
# Run scraping with monitoring
urls = [
'https://httpbin.org/json',
'https://httpbin.org/html',
'https://httpbin.org/xml'
] * 10 # Scrape each URL 10 times
scrape_with_monitoring(urls)
# Get comprehensive statistics
stats = monitor.get_comprehensive_stats()
print("\nComprehensive Performance Statistics:")
print(json.dumps(stats, indent=2))
# Stop monitoring
monitor.stop_monitoring()
Integration with Monitoring Tools
Prometheus Integration
For production environments, integrate with monitoring systems like Prometheus:
from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time
import requests
# Define Prometheus metrics
http_requests_total = Counter('http_requests_total', 'Total HTTP requests', ['method', 'status'])
http_request_duration = Histogram('http_request_duration_seconds', 'HTTP request duration')
active_scraping_sessions = Gauge('active_scraping_sessions', 'Number of active scraping sessions')
class PrometheusMonitoredScraper:
def __init__(self):
self.session = requests.Session()
active_scraping_sessions.inc()
def scrape(self, url, method='GET'):
start_time = time.time()
try:
if method.upper() == 'GET':
response = self.session.get(url)
else:
response = self.session.post(url)
# Record metrics
http_requests_total.labels(method=method, status=response.status_code).inc()
http_request_duration.observe(time.time() - start_time)
return response
except Exception as e:
http_requests_total.labels(method=method, status='error').inc()
http_request_duration.observe(time.time() - start_time)
raise
def __del__(self):
active_scraping_sessions.dec()
# Start Prometheus metrics server
start_http_server(8000)
Performance Optimization Based on Metrics
When monitoring network requests in Puppeteer, you can apply similar performance tracking principles to browser-based scraping. Use the collected metrics to optimize your scraping strategy:
def optimize_scraping_based_on_metrics(monitor: PerformanceMonitor):
"""Adjust scraping parameters based on performance metrics"""
stats = monitor.get_stats()
recommendations = []
# Check response time performance
if stats.get('avg_response_time', 0) > 5.0:
recommendations.append("Consider reducing timeout values or using faster endpoints")
# Check error rate
error_rate = stats.get('error_rate', 0)
if error_rate > 0.1: # More than 10% error rate
recommendations.append("High error rate detected - implement retry logic or rate limiting")
# Check for rate limiting (429 status codes)
status_dist = stats.get('status_code_distribution', {})
if status_dist.get(429, 0) > 0:
recommendations.append("Rate limiting detected - implement exponential backoff")
# Memory usage optimization
total_data = stats.get('total_data_transferred', 0)
if total_data > 100 * 1024 * 1024: # More than 100MB
recommendations.append("High data transfer - consider request filtering or compression")
return recommendations
# Example usage
recommendations = optimize_scraping_based_on_metrics(monitor)
for rec in recommendations:
print(f"Optimization recommendation: {rec}")
Monitoring Browser-Based Scraping
For browser-based scraping with tools like Puppeteer, you can implement performance monitoring at the page level:
const puppeteer = require('puppeteer');
class PuppeteerPerformanceMonitor {
constructor() {
this.pageMetrics = [];
this.networkMetrics = [];
}
async monitorPage(page) {
const startTime = Date.now();
// Monitor network requests
page.on('request', (request) => {
request.startTime = Date.now();
});
page.on('response', (response) => {
const request = response.request();
if (request.startTime) {
this.networkMetrics.push({
url: response.url(),
status: response.status(),
responseTime: Date.now() - request.startTime,
contentLength: response.headers()['content-length'] || 0,
resourceType: request.resourceType()
});
}
});
// Monitor page performance
await page.evaluateOnNewDocument(() => {
window.addEventListener('load', () => {
const perfData = performance.getEntriesByType('navigation')[0];
window.__pageMetrics = {
domContentLoaded: perfData.domContentLoadedEventEnd - perfData.domContentLoadedEventStart,
loadComplete: perfData.loadEventEnd - perfData.loadEventStart,
totalLoadTime: perfData.loadEventEnd - perfData.fetchStart
};
});
});
return {
getMetrics: async () => {
const pageMetrics = await page.evaluate(() => window.__pageMetrics || {});
return {
pageMetrics,
networkMetrics: this.networkMetrics,
totalNetworkRequests: this.networkMetrics.length,
avgNetworkResponseTime: this.networkMetrics.reduce((sum, m) => sum + m.responseTime, 0) / this.networkMetrics.length
};
}
};
}
}
// Usage example
async function scrapeWithPuppeteerMonitoring() {
const browser = await puppeteer.launch();
const page = await browser.newPage();
const monitor = new PuppeteerPerformanceMonitor();
const { getMetrics } = await monitor.monitorPage(page);
await page.goto('https://example.com');
await page.waitForLoadState('networkidle');
const metrics = await getMetrics();
console.log('Page Performance Metrics:', metrics);
await browser.close();
}
Best Practices for HTTP Performance Monitoring
Establish Baselines: Record performance metrics during normal operations to establish baseline performance levels.
Monitor Continuously: Implement continuous monitoring rather than periodic checks to catch performance degradation early.
Set Alerting Thresholds: Configure alerts for critical metrics like error rates above 5% or response times exceeding acceptable limits.
Track Business Metrics: Monitor not just technical metrics but also business-relevant indicators like data freshness and scraping completeness.
Use Circuit Breakers: Implement circuit breaker patterns to prevent cascading failures when performance degrades.
Optimize Based on Data: Regularly analyze performance data to identify optimization opportunities and bottlenecks.
For browser-based scraping scenarios, especially when handling timeouts in Puppeteer, apply similar monitoring principles to track page load times, JavaScript execution duration, and resource loading performance.
Console Commands for Performance Analysis
Monitor your scraping applications using these useful console commands:
# Monitor network connections and bandwidth usage
netstat -i 1
# Track HTTP response times with curl
curl -w "@curl-format.txt" -o /dev/null -s "http://example.com"
# Monitor system resources during scraping
top -p $(pgrep -f python)
# Analyze HTTP traffic with tcpdump
tcpdump -i any -s 0 -w scraping-traffic.pcap port 80 or port 443
# Real-time monitoring with htop
htop -p $(pgrep -f "your-scraper-process")
Create a curl timing format file:
# Create curl-format.txt
cat > curl-format.txt << 'EOF'
time_namelookup: %{time_namelookup}\n
time_connect: %{time_connect}\n
time_appconnect: %{time_appconnect}\n
time_pretransfer: %{time_pretransfer}\n
time_redirect: %{time_redirect}\n
time_starttransfer: %{time_starttransfer}\n
----------\n
time_total: %{time_total}\n
EOF
Conclusion
Effective HTTP performance monitoring is essential for maintaining robust web scraping operations. By implementing comprehensive metrics collection, analyzing performance data, and optimizing based on insights, you can ensure your scraping applications remain efficient, reliable, and scalable. The monitoring strategies and code examples provided in this guide offer a solid foundation for building production-ready performance monitoring systems for your web scraping projects.
Remember to regularly review and update your monitoring approach as your scraping requirements evolve, and always consider the impact of monitoring overhead on your application's performance.