How do you implement API logging for debugging scraping issues?
API logging is crucial for debugging web scraping applications, providing visibility into request/response cycles, error patterns, and performance bottlenecks. Proper logging helps identify issues like rate limiting, authentication failures, and data extraction problems before they impact your scraping operations.
Understanding API Logging Fundamentals
API logging involves capturing detailed information about HTTP requests and responses, including headers, status codes, response times, and error messages. This data becomes invaluable when troubleshooting scraping issues or optimizing performance.
Key Components to Log
- Request details: URL, method, headers, parameters, body
- Response information: Status code, headers, body size, content type
- Timing metrics: Request duration, connection time, DNS lookup time
- Error information: Exception details, stack traces, retry attempts
- Context data: User agent, proxy information, session details
Python Implementation with Requests Library
Here's a comprehensive logging implementation using Python's requests
library:
import logging
import requests
import time
from urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter
# Configure logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
handlers=[
logging.FileHandler('scraping.log'),
logging.StreamHandler()
]
)
class ScrapingLogger:
def __init__(self, name='web_scraper'):
self.logger = logging.getLogger(name)
def log_request(self, method, url, headers=None, params=None, data=None):
"""Log outgoing request details"""
self.logger.info(f"REQUEST: {method} {url}")
if headers:
self.logger.debug(f"Headers: {headers}")
if params:
self.logger.debug(f"Params: {params}")
if data:
self.logger.debug(f"Data: {str(data)[:200]}...")
def log_response(self, response, duration):
"""Log response details and metrics"""
self.logger.info(
f"RESPONSE: {response.status_code} {response.reason} "
f"({duration:.2f}s) - {len(response.content)} bytes"
)
self.logger.debug(f"Response Headers: {dict(response.headers)}")
if response.status_code >= 400:
self.logger.error(f"Error Response Body: {response.text[:500]}")
def log_error(self, error, url):
"""Log errors with context"""
self.logger.error(f"Error scraping {url}: {str(error)}", exc_info=True)
class LoggingSession:
def __init__(self):
self.session = requests.Session()
self.logger = ScrapingLogger()
# Configure retry strategy
retry_strategy = Retry(
total=3,
backoff_factor=1,
status_forcelist=[429, 500, 502, 503, 504]
)
adapter = HTTPAdapter(max_retries=retry_strategy)
self.session.mount("http://", adapter)
self.session.mount("https://", adapter)
def request(self, method, url, **kwargs):
"""Make request with comprehensive logging"""
start_time = time.time()
# Log request
self.logger.log_request(
method, url,
headers=kwargs.get('headers'),
params=kwargs.get('params'),
data=kwargs.get('data')
)
try:
response = self.session.request(method, url, **kwargs)
duration = time.time() - start_time
# Log response
self.logger.log_response(response, duration)
return response
except Exception as e:
duration = time.time() - start_time
self.logger.log_error(e, url)
raise
# Usage example
scraper = LoggingSession()
try:
response = scraper.request(
'GET',
'https://api.example.com/data',
headers={'User-Agent': 'MyScraperBot/1.0'},
params={'page': 1, 'limit': 50}
)
if response.status_code == 200:
data = response.json()
logging.info(f"Successfully extracted {len(data)} items")
except requests.exceptions.RequestException as e:
logging.error(f"Scraping failed: {e}")
JavaScript/Node.js Implementation
For JavaScript applications, here's a logging implementation using Axios:
const axios = require('axios');
const winston = require('winston');
const { performance } = require('perf_hooks');
// Configure Winston logger
const logger = winston.createLogger({
level: 'info',
format: winston.format.combine(
winston.format.timestamp(),
winston.format.errors({ stack: true }),
winston.format.json()
),
transports: [
new winston.transports.File({ filename: 'scraping-error.log', level: 'error' }),
new winston.transports.File({ filename: 'scraping.log' }),
new winston.transports.Console()
]
});
class ScrapingClient {
constructor() {
this.client = axios.create({
timeout: 30000,
maxRedirects: 5
});
this.setupInterceptors();
}
setupInterceptors() {
// Request interceptor
this.client.interceptors.request.use(
(config) => {
config.metadata = { startTime: performance.now() };
logger.info('API Request', {
method: config.method.toUpperCase(),
url: config.url,
headers: config.headers,
params: config.params
});
return config;
},
(error) => {
logger.error('Request Error', { error: error.message });
return Promise.reject(error);
}
);
// Response interceptor
this.client.interceptors.response.use(
(response) => {
const duration = performance.now() - response.config.metadata.startTime;
logger.info('API Response', {
status: response.status,
statusText: response.statusText,
url: response.config.url,
duration: `${duration.toFixed(2)}ms`,
size: JSON.stringify(response.data).length,
headers: response.headers
});
return response;
},
(error) => {
const duration = error.config && error.config.metadata ?
performance.now() - error.config.metadata.startTime : 0;
logger.error('API Error', {
message: error.message,
status: error.response?.status,
statusText: error.response?.statusText,
url: error.config?.url,
duration: `${duration.toFixed(2)}ms`,
responseData: error.response?.data
});
return Promise.reject(error);
}
);
}
async scrapeData(url, options = {}) {
try {
const response = await this.client.get(url, options);
logger.info('Scraping Success', {
url,
dataLength: response.data.length,
contentType: response.headers['content-type']
});
return response.data;
} catch (error) {
logger.error('Scraping Failed', {
url,
error: error.message,
stack: error.stack
});
throw error;
}
}
}
// Usage example
const scraper = new ScrapingClient();
(async () => {
try {
const data = await scraper.scrapeData('https://api.example.com/data', {
headers: { 'User-Agent': 'NodeScraperBot/1.0' },
params: { page: 1, limit: 50 }
});
logger.info(`Successfully scraped ${data.length} items`);
} catch (error) {
logger.error('Main scraping process failed', { error: error.message });
}
})();
Advanced Logging Strategies
Structured Logging with Context
Implement structured logging to make log analysis easier:
import structlog
import json
# Configure structured logging
structlog.configure(
processors=[
structlog.stdlib.filter_by_level,
structlog.stdlib.add_logger_name,
structlog.stdlib.add_log_level,
structlog.stdlib.PositionalArgumentsFormatter(),
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.JSONRenderer()
],
context_class=dict,
logger_factory=structlog.stdlib.LoggerFactory(),
cache_logger_on_first_use=True,
)
class ContextualLogger:
def __init__(self, scraper_id, target_site):
self.logger = structlog.get_logger()
self.context = {
'scraper_id': scraper_id,
'target_site': target_site,
'session_id': generate_session_id()
}
def log_with_context(self, level, message, **kwargs):
log_data = {**self.context, **kwargs}
getattr(self.logger, level)(message, **log_data)
def log_request_cycle(self, request_data, response_data, metrics):
self.log_with_context('info', 'request_cycle_complete',
request=request_data,
response=response_data,
metrics=metrics
)
Performance Monitoring Integration
Combine logging with performance monitoring for comprehensive debugging:
import psutil
import threading
from datetime import datetime
class PerformanceLogger:
def __init__(self):
self.logger = logging.getLogger('performance')
self.start_monitoring()
def start_monitoring(self):
"""Start background performance monitoring"""
def monitor():
while True:
memory_usage = psutil.virtual_memory().percent
cpu_usage = psutil.cpu_percent()
if memory_usage > 80 or cpu_usage > 90:
self.logger.warning('High resource usage detected',
extra={
'memory_percent': memory_usage,
'cpu_percent': cpu_usage,
'timestamp': datetime.now().isoformat()
})
time.sleep(30)
monitor_thread = threading.Thread(target=monitor, daemon=True)
monitor_thread.start()
def log_scraping_metrics(self, url, start_time, end_time, success):
duration = end_time - start_time
self.logger.info('Scraping metrics',
extra={
'url': url,
'duration_seconds': duration,
'success': success,
'memory_usage': psutil.virtual_memory().percent,
'timestamp': datetime.now().isoformat()
})
Console Commands for Log Analysis
Use these command-line tools to analyze your scraping logs:
# Filter logs by status code
grep "RESPONSE: 40[0-9]" scraping.log
# Count successful vs failed requests
grep -c "RESPONSE: 2[0-9][0-9]" scraping.log
grep -c "RESPONSE: [45][0-9][0-9]" scraping.log
# Find slowest requests (over 5 seconds)
grep -E "RESPONSE:.*\([5-9]\d*\.\d+s\)" scraping.log
# Extract unique error patterns
grep "Error scraping" scraping.log | cut -d':' -f4- | sort | uniq -c
# Monitor logs in real-time
tail -f scraping.log | grep --color=always -E "(ERROR|WARNING)"
# Analyze response times with awk
grep "RESPONSE:" scraping.log | awk '{print $NF}' | sed 's/[()s]//g' | sort -n
Integration with External Tools
ELK Stack Integration
For production environments, integrate with Elasticsearch, Logstash, and Kibana:
from pythonjsonlogger import jsonlogger
import logging
# Configure JSON logging for ELK
json_handler = logging.StreamHandler()
formatter = jsonlogger.JsonFormatter(
'%(asctime)s %(name)s %(levelname)s %(message)s'
)
json_handler.setFormatter(formatter)
elk_logger = logging.getLogger('scraping')
elk_logger.addHandler(json_handler)
elk_logger.setLevel(logging.INFO)
# Usage with additional fields
elk_logger.info('Request completed', extra={
'url': url,
'status_code': response.status_code,
'response_time': duration,
'user_agent': headers.get('User-Agent'),
'proxy_used': proxy_info
})
Best Practices for Scraping Logs
Log Level Management
Implement appropriate log levels for different scenarios:
- DEBUG: Detailed request/response bodies, headers
- INFO: Successful operations, basic metrics
- WARNING: Retry attempts, rate limit warnings
- ERROR: Failed requests, parsing errors
- CRITICAL: System failures, authentication issues
Security Considerations
Avoid logging sensitive information:
import re
class SecureLogger:
SENSITIVE_PATTERNS = [
r'api[_-]?key=[\w-]+',
r'password=[\w-]+',
r'token=[\w.-]+',
r'authorization:\s*bearer\s+[\w.-]+'
]
def sanitize_data(self, data):
"""Remove sensitive information from log data"""
if isinstance(data, str):
for pattern in self.SENSITIVE_PATTERNS:
data = re.sub(pattern, '[REDACTED]', data, flags=re.IGNORECASE)
return data
def safe_log(self, level, message, data=None):
if data:
data = self.sanitize_data(str(data))
self.logger.log(level, message, extra={'data': data})
When debugging complex scraping operations, proper API logging becomes essential for understanding request patterns and identifying bottlenecks. This comprehensive logging approach, combined with tools for monitoring network requests in Puppeteer, provides complete visibility into your scraping infrastructure.
Conclusion
Implementing comprehensive API logging transforms debugging from guesswork into systematic problem-solving. By capturing detailed request/response cycles, performance metrics, and error patterns, you can quickly identify and resolve scraping issues. Remember to balance logging detail with performance impact, and always sanitize sensitive information from your logs.
The logging strategies outlined here provide a solid foundation for debugging any web scraping application, whether you're dealing with simple HTTP requests or complex browser automation scenarios with tools that handle errors in Puppeteer.