How can I manage HTTP connection limits and timeouts?
Managing HTTP connection limits and timeouts is crucial for building robust web scraping applications that can handle high-volume requests efficiently while avoiding server overload and connection failures. Proper configuration prevents bottlenecks, reduces resource consumption, and ensures reliable data extraction.
Understanding HTTP Connection Management
HTTP connection management involves controlling how your application establishes, maintains, and reuses connections to web servers. Key concepts include:
- Connection Pooling: Reusing existing connections instead of creating new ones for each request
- Connection Limits: Maximum number of concurrent connections to prevent resource exhaustion
- Timeouts: Time limits for connection establishment and data transfer
- Keep-Alive: Maintaining connections open for multiple requests
Connection Pooling and Limits
Python with requests and urllib3
Python's requests
library uses urllib3
for connection pooling. Here's how to configure it:
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
# Create a session with custom connection pooling
session = requests.Session()
# Configure HTTPAdapter with connection pooling
adapter = HTTPAdapter(
pool_connections=10, # Number of connection pools to cache
pool_maxsize=20, # Maximum connections in each pool
max_retries=3, # Retry failed requests
pool_block=False # Don't block when pool is full
)
session.mount('http://', adapter)
session.mount('https://', adapter)
# Example request with the configured session
response = session.get('https://example.com')
For more advanced control with httpx
:
import httpx
import asyncio
# Synchronous client with connection limits
with httpx.Client(
limits=httpx.Limits(
max_keepalive_connections=10,
max_connections=50,
keepalive_expiry=30.0
),
timeout=httpx.Timeout(30.0)
) as client:
response = client.get('https://example.com')
# Asynchronous client for high-performance scraping
async def fetch_urls(urls):
limits = httpx.Limits(
max_keepalive_connections=20,
max_connections=100
)
async with httpx.AsyncClient(limits=limits) as client:
tasks = [client.get(url) for url in urls]
responses = await asyncio.gather(*tasks, return_exceptions=True)
return responses
JavaScript with axios and node.js
Configure connection pooling in Node.js applications:
const axios = require('axios');
const http = require('http');
const https = require('https');
// Create HTTP agents with connection pooling
const httpAgent = new http.Agent({
keepAlive: true,
maxSockets: 50, // Max connections per host
maxFreeSockets: 10, // Max idle connections per host
timeout: 60000, // Socket timeout
freeSocketTimeout: 30000, // Idle socket timeout
});
const httpsAgent = new https.Agent({
keepAlive: true,
maxSockets: 50,
maxFreeSockets: 10,
timeout: 60000,
freeSocketTimeout: 30000,
});
// Configure axios with custom agents
const client = axios.create({
httpAgent: httpAgent,
httpsAgent: httpsAgent,
timeout: 30000, // Request timeout
});
// Example usage
async function scrapeUrls(urls) {
const promises = urls.map(url =>
client.get(url).catch(error => ({ error, url }))
);
const results = await Promise.allSettled(promises);
return results;
}
For fetch API with custom connection management:
// Using undici for better connection management
const { Pool, Agent, setGlobalDispatcher } = require('undici');
// Create a global agent with connection limits
const agent = new Agent({
connections: 50, // Max connections per origin
pipelining: 1, // HTTP pipelining factor
keepAliveTimeout: 30000,
keepAliveMaxTimeout: 600000
});
setGlobalDispatcher(agent);
// Use with fetch
async function fetchWithLimits(url) {
try {
const response = await fetch(url, {
signal: AbortSignal.timeout(30000) // 30-second timeout
});
return await response.text();
} catch (error) {
console.error(`Failed to fetch ${url}:`, error.message);
return null;
}
}
Timeout Configuration
Connection vs Request Timeouts
Different types of timeouts serve different purposes:
import httpx
# Comprehensive timeout configuration
timeout = httpx.Timeout(
connect=10.0, # Time to establish connection
read=30.0, # Time to read response data
write=10.0, # Time to send request data
pool=5.0 # Time to acquire connection from pool
)
client = httpx.Client(timeout=timeout)
# Per-request timeout override
response = client.get(
'https://slow-api.example.com',
timeout=60.0 # Override default timeout
)
Dynamic Timeout Adjustment
Implement adaptive timeouts based on response patterns:
import time
from statistics import mean
class AdaptiveHttpClient:
def __init__(self):
self.response_times = []
self.base_timeout = 30.0
def calculate_timeout(self):
if len(self.response_times) < 5:
return self.base_timeout
avg_time = mean(self.response_times[-10:]) # Last 10 requests
return min(avg_time * 3, 120.0) # 3x average, max 2 minutes
def request(self, url):
timeout = self.calculate_timeout()
start_time = time.time()
try:
response = requests.get(url, timeout=timeout)
response_time = time.time() - start_time
self.response_times.append(response_time)
return response
except requests.exceptions.Timeout:
# Increase timeout for slow endpoints
self.response_times.append(timeout)
raise
Browser-Based Scraping Timeout Management
When working with browser automation tools, timeout management becomes even more critical. For comprehensive guidance on handling timeouts in browser automation, proper configuration prevents hanging processes.
// Puppeteer timeout configuration
const puppeteer = require('puppeteer');
const browser = await puppeteer.launch({
args: ['--no-sandbox', '--disable-dev-shm-usage']
});
const page = await browser.newPage();
// Set various timeouts
page.setDefaultTimeout(60000); // Default timeout for all operations
page.setDefaultNavigationTimeout(30000); // Navigation-specific timeout
// Per-operation timeouts
await page.goto('https://example.com', {
waitUntil: 'networkidle2',
timeout: 45000
});
await page.waitForSelector('.dynamic-content', {
timeout: 20000
});
Production-Ready Connection Management
Load Balancing and Circuit Breaker Pattern
Implement circuit breakers to handle failing endpoints gracefully:
import time
from enum import Enum
from dataclasses import dataclass
from typing import Dict, Any
class CircuitState(Enum):
CLOSED = "closed"
OPEN = "open"
HALF_OPEN = "half_open"
@dataclass
class CircuitBreaker:
failure_threshold: int = 5
timeout: int = 60
def __post_init__(self):
self.failure_count = 0
self.last_failure_time = None
self.state = CircuitState.CLOSED
def call(self, func, *args, **kwargs):
if self.state == CircuitState.OPEN:
if time.time() - self.last_failure_time > self.timeout:
self.state = CircuitState.HALF_OPEN
else:
raise Exception("Circuit breaker is OPEN")
try:
result = func(*args, **kwargs)
self.on_success()
return result
except Exception as e:
self.on_failure()
raise e
def on_success(self):
self.failure_count = 0
self.state = CircuitState.CLOSED
def on_failure(self):
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = CircuitState.OPEN
# Usage with HTTP client
circuit_breaker = CircuitBreaker()
def safe_request(url):
return circuit_breaker.call(requests.get, url, timeout=30)
Connection Pool Monitoring
Monitor connection pool health and performance:
import threading
import time
from urllib3.poolmanager import PoolManager
class MonitoredPoolManager(PoolManager):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.stats = {
'active_connections': 0,
'total_requests': 0,
'failed_requests': 0
}
self._lock = threading.Lock()
def urlopen(self, method, url, *args, **kwargs):
with self._lock:
self.stats['active_connections'] += 1
self.stats['total_requests'] += 1
try:
response = super().urlopen(method, url, *args, **kwargs)
return response
except Exception as e:
with self._lock:
self.stats['failed_requests'] += 1
raise e
finally:
with self._lock:
self.stats['active_connections'] -= 1
def get_stats(self):
with self._lock:
return self.stats.copy()
# Usage
pool = MonitoredPoolManager(
num_pools=10,
maxsize=20,
retries=3
)
# Monitor pool stats
def monitor_pool():
while True:
stats = pool.get_stats()
print(f"Pool stats: {stats}")
time.sleep(10)
threading.Thread(target=monitor_pool, daemon=True).start()
Error Handling and Retry Logic
Implement exponential backoff for failed connections:
import random
import time
from functools import wraps
def retry_with_backoff(max_retries=3, base_delay=1, max_delay=60):
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
for attempt in range(max_retries + 1):
try:
return func(*args, **kwargs)
except (requests.exceptions.ConnectionError,
requests.exceptions.Timeout) as e:
if attempt == max_retries:
raise e
# Exponential backoff with jitter
delay = min(base_delay * (2 ** attempt), max_delay)
jitter = random.uniform(0, delay * 0.1)
time.sleep(delay + jitter)
return wrapper
return decorator
@retry_with_backoff(max_retries=5, base_delay=2)
def reliable_request(url):
return requests.get(url, timeout=30)
Advanced Connection Management Strategies
Connection Pool Warming
Pre-establish connections to improve initial request performance:
import concurrent.futures
import requests
class WarmConnectionPool:
def __init__(self, hosts, pool_size=10):
self.hosts = hosts
self.session = requests.Session()
# Configure connection pooling
adapter = requests.adapters.HTTPAdapter(
pool_connections=len(hosts),
pool_maxsize=pool_size
)
self.session.mount('http://', adapter)
self.session.mount('https://', adapter)
self.warm_connections()
def warm_connections(self):
"""Pre-establish connections to all hosts"""
def make_head_request(host):
try:
self.session.head(host, timeout=5)
except Exception:
pass # Ignore errors during warming
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
executor.map(make_head_request, self.hosts)
def get(self, url, **kwargs):
return self.session.get(url, **kwargs)
# Usage
hosts = ['https://api1.example.com', 'https://api2.example.com']
pool = WarmConnectionPool(hosts)
Per-Host Connection Limits
Implement different connection limits for different hosts:
import requests
from urllib.parse import urlparse
from requests.adapters import HTTPAdapter
class PerHostConnectionManager:
def __init__(self):
self.sessions = {}
self.host_configs = {
'api.example.com': {'pool_maxsize': 50, 'timeout': 30},
'slow-api.example.com': {'pool_maxsize': 10, 'timeout': 120},
'default': {'pool_maxsize': 20, 'timeout': 60}
}
def get_session(self, url):
host = urlparse(url).netloc
if host not in self.sessions:
config = self.host_configs.get(host, self.host_configs['default'])
session = requests.Session()
adapter = HTTPAdapter(
pool_maxsize=config['pool_maxsize'],
pool_connections=1
)
session.mount('http://', adapter)
session.mount('https://', adapter)
session.timeout = config['timeout']
self.sessions[host] = session
return self.sessions[host]
def request(self, method, url, **kwargs):
session = self.get_session(url)
return session.request(method, url, **kwargs)
# Usage
manager = PerHostConnectionManager()
response = manager.request('GET', 'https://api.example.com/data')
Best Practices for Production
- Set Appropriate Limits: Don't overwhelm target servers with too many concurrent connections
- Monitor Performance: Track connection pool utilization and response times
- Implement Graceful Degradation: Handle connection failures without crashing the application
- Use Connection Pooling: Reuse connections to improve performance and reduce overhead
- Configure Realistic Timeouts: Balance between reliability and performance
- Implement Rate Limiting: Respect server resources and API limits
For complex scraping scenarios involving browser session management, these connection management principles become even more important as browser instances consume significant resources.
Monitoring and Debugging
Connection Pool Metrics
Track important metrics to optimize performance:
# Monitor system-level connection stats
netstat -an | grep :80 | wc -l # Count HTTP connections
netstat -an | grep :443 | wc -l # Count HTTPS connections
ss -tuln # Show listening sockets
Application-Level Monitoring
import psutil
import time
def monitor_connections():
"""Monitor application connection usage"""
process = psutil.Process()
while True:
connections = process.connections()
established = len([c for c in connections if c.status == 'ESTABLISHED'])
time_wait = len([c for c in connections if c.status == 'TIME_WAIT'])
print(f"Established: {established}, TIME_WAIT: {time_wait}")
time.sleep(10)
# Run monitoring in background thread
import threading
monitor_thread = threading.Thread(target=monitor_connections, daemon=True)
monitor_thread.start()
By properly managing HTTP connection limits and timeouts, your web scraping applications will be more reliable, efficient, and respectful of target server resources. Regular monitoring and adjustment of these parameters based on actual performance data ensures optimal results in production environments.