How can I manage HTTP connection limits and timeouts?

Managing HTTP connection limits and timeouts is crucial for building robust web scraping applications that can handle high-volume requests efficiently while avoiding server overload and connection failures. Proper configuration prevents bottlenecks, reduces resource consumption, and ensures reliable data extraction.

Understanding HTTP Connection Management

HTTP connection management involves controlling how your application establishes, maintains, and reuses connections to web servers. Key concepts include:

Connection Pooling: Reusing existing connections instead of creating new ones for each request
Connection Limits: Maximum number of concurrent connections to prevent resource exhaustion
Timeouts: Time limits for connection establishment and data transfer
Keep-Alive: Maintaining connections open for multiple requests

Connection Pooling and Limits

Python with requests and urllib3

Python's requests library uses urllib3 for connection pooling. Here's how to configure it:

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

# Create a session with custom connection pooling
session = requests.Session()

# Configure HTTPAdapter with connection pooling
adapter = HTTPAdapter(
    pool_connections=10,    # Number of connection pools to cache
    pool_maxsize=20,        # Maximum connections in each pool
    max_retries=3,          # Retry failed requests
    pool_block=False        # Don't block when pool is full
)

session.mount('http://', adapter)
session.mount('https://', adapter)

# Example request with the configured session
response = session.get('https://example.com')

For more advanced control with httpx:

import httpx
import asyncio

# Synchronous client with connection limits
with httpx.Client(
    limits=httpx.Limits(
        max_keepalive_connections=10,
        max_connections=50,
        keepalive_expiry=30.0
    ),
    timeout=httpx.Timeout(30.0)
) as client:
    response = client.get('https://example.com')

# Asynchronous client for high-performance scraping
async def fetch_urls(urls):
    limits = httpx.Limits(
        max_keepalive_connections=20,
        max_connections=100
    )

    async with httpx.AsyncClient(limits=limits) as client:
        tasks = [client.get(url) for url in urls]
        responses = await asyncio.gather(*tasks, return_exceptions=True)
        return responses

JavaScript with axios and node.js

Configure connection pooling in Node.js applications:

const axios = require('axios');
const http = require('http');
const https = require('https');

// Create HTTP agents with connection pooling
const httpAgent = new http.Agent({
    keepAlive: true,
    maxSockets: 50,        // Max connections per host
    maxFreeSockets: 10,    // Max idle connections per host
    timeout: 60000,        // Socket timeout
    freeSocketTimeout: 30000, // Idle socket timeout
});

const httpsAgent = new https.Agent({
    keepAlive: true,
    maxSockets: 50,
    maxFreeSockets: 10,
    timeout: 60000,
    freeSocketTimeout: 30000,
});

// Configure axios with custom agents
const client = axios.create({
    httpAgent: httpAgent,
    httpsAgent: httpsAgent,
    timeout: 30000,        // Request timeout
});

// Example usage
async function scrapeUrls(urls) {
    const promises = urls.map(url => 
        client.get(url).catch(error => ({ error, url }))
    );

    const results = await Promise.allSettled(promises);
    return results;
}

For fetch API with custom connection management:

// Using undici for better connection management
const { Pool, Agent, setGlobalDispatcher } = require('undici');

// Create a global agent with connection limits
const agent = new Agent({
    connections: 50,       // Max connections per origin
    pipelining: 1,         // HTTP pipelining factor
    keepAliveTimeout: 30000,
    keepAliveMaxTimeout: 600000
});

setGlobalDispatcher(agent);

// Use with fetch
async function fetchWithLimits(url) {
    try {
        const response = await fetch(url, {
            signal: AbortSignal.timeout(30000) // 30-second timeout
        });
        return await response.text();
    } catch (error) {
        console.error(`Failed to fetch ${url}:`, error.message);
        return null;
    }
}

Timeout Configuration

Connection vs Request Timeouts

Different types of timeouts serve different purposes:

import httpx

# Comprehensive timeout configuration
timeout = httpx.Timeout(
    connect=10.0,    # Time to establish connection
    read=30.0,       # Time to read response data
    write=10.0,      # Time to send request data
    pool=5.0         # Time to acquire connection from pool
)

client = httpx.Client(timeout=timeout)

# Per-request timeout override
response = client.get(
    'https://slow-api.example.com',
    timeout=60.0  # Override default timeout
)

Dynamic Timeout Adjustment

Implement adaptive timeouts based on response patterns:

import time
from statistics import mean

class AdaptiveHttpClient:
    def __init__(self):
        self.response_times = []
        self.base_timeout = 30.0

    def calculate_timeout(self):
        if len(self.response_times) < 5:
            return self.base_timeout

        avg_time = mean(self.response_times[-10:])  # Last 10 requests
        return min(avg_time * 3, 120.0)  # 3x average, max 2 minutes

    def request(self, url):
        timeout = self.calculate_timeout()
        start_time = time.time()

        try:
            response = requests.get(url, timeout=timeout)
            response_time = time.time() - start_time
            self.response_times.append(response_time)
            return response
        except requests.exceptions.Timeout:
            # Increase timeout for slow endpoints
            self.response_times.append(timeout)
            raise

Browser-Based Scraping Timeout Management

When working with browser automation tools, timeout management becomes even more critical. For comprehensive guidance on handling timeouts in browser automation, proper configuration prevents hanging processes.

// Puppeteer timeout configuration
const puppeteer = require('puppeteer');

const browser = await puppeteer.launch({
    args: ['--no-sandbox', '--disable-dev-shm-usage']
});

const page = await browser.newPage();

// Set various timeouts
page.setDefaultTimeout(60000);        // Default timeout for all operations
page.setDefaultNavigationTimeout(30000); // Navigation-specific timeout

// Per-operation timeouts
await page.goto('https://example.com', {
    waitUntil: 'networkidle2',
    timeout: 45000
});

await page.waitForSelector('.dynamic-content', {
    timeout: 20000
});

Production-Ready Connection Management

Load Balancing and Circuit Breaker Pattern

Implement circuit breakers to handle failing endpoints gracefully:

import time
from enum import Enum
from dataclasses import dataclass
from typing import Dict, Any

class CircuitState(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

@dataclass
class CircuitBreaker:
    failure_threshold: int = 5
    timeout: int = 60

    def __post_init__(self):
        self.failure_count = 0
        self.last_failure_time = None
        self.state = CircuitState.CLOSED

    def call(self, func, *args, **kwargs):
        if self.state == CircuitState.OPEN:
            if time.time() - self.last_failure_time > self.timeout:
                self.state = CircuitState.HALF_OPEN
            else:
                raise Exception("Circuit breaker is OPEN")

        try:
            result = func(*args, **kwargs)
            self.on_success()
            return result
        except Exception as e:
            self.on_failure()
            raise e

    def on_success(self):
        self.failure_count = 0
        self.state = CircuitState.CLOSED

    def on_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.time()

        if self.failure_count >= self.failure_threshold:
            self.state = CircuitState.OPEN

# Usage with HTTP client
circuit_breaker = CircuitBreaker()

def safe_request(url):
    return circuit_breaker.call(requests.get, url, timeout=30)

Connection Pool Monitoring

Monitor connection pool health and performance:

import threading
import time
from urllib3.poolmanager import PoolManager

class MonitoredPoolManager(PoolManager):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.stats = {
            'active_connections': 0,
            'total_requests': 0,
            'failed_requests': 0
        }
        self._lock = threading.Lock()

    def urlopen(self, method, url, *args, **kwargs):
        with self._lock:
            self.stats['active_connections'] += 1
            self.stats['total_requests'] += 1

        try:
            response = super().urlopen(method, url, *args, **kwargs)
            return response
        except Exception as e:
            with self._lock:
                self.stats['failed_requests'] += 1
            raise e
        finally:
            with self._lock:
                self.stats['active_connections'] -= 1

    def get_stats(self):
        with self._lock:
            return self.stats.copy()

# Usage
pool = MonitoredPoolManager(
    num_pools=10,
    maxsize=20,
    retries=3
)

# Monitor pool stats
def monitor_pool():
    while True:
        stats = pool.get_stats()
        print(f"Pool stats: {stats}")
        time.sleep(10)

threading.Thread(target=monitor_pool, daemon=True).start()

Error Handling and Retry Logic

Implement exponential backoff for failed connections:

import random
import time
from functools import wraps

def retry_with_backoff(max_retries=3, base_delay=1, max_delay=60):
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            for attempt in range(max_retries + 1):
                try:
                    return func(*args, **kwargs)
                except (requests.exceptions.ConnectionError,
                        requests.exceptions.Timeout) as e:
                    if attempt == max_retries:
                        raise e

                    # Exponential backoff with jitter
                    delay = min(base_delay * (2 ** attempt), max_delay)
                    jitter = random.uniform(0, delay * 0.1)
                    time.sleep(delay + jitter)

        return wrapper
    return decorator

@retry_with_backoff(max_retries=5, base_delay=2)
def reliable_request(url):
    return requests.get(url, timeout=30)

Advanced Connection Management Strategies

Connection Pool Warming

Pre-establish connections to improve initial request performance:

import concurrent.futures
import requests

class WarmConnectionPool:
    def __init__(self, hosts, pool_size=10):
        self.hosts = hosts
        self.session = requests.Session()

        # Configure connection pooling
        adapter = requests.adapters.HTTPAdapter(
            pool_connections=len(hosts),
            pool_maxsize=pool_size
        )
        self.session.mount('http://', adapter)
        self.session.mount('https://', adapter)

        self.warm_connections()

    def warm_connections(self):
        """Pre-establish connections to all hosts"""
        def make_head_request(host):
            try:
                self.session.head(host, timeout=5)
            except Exception:
                pass  # Ignore errors during warming

        with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
            executor.map(make_head_request, self.hosts)

    def get(self, url, **kwargs):
        return self.session.get(url, **kwargs)

# Usage
hosts = ['https://api1.example.com', 'https://api2.example.com']
pool = WarmConnectionPool(hosts)

Per-Host Connection Limits

Implement different connection limits for different hosts:

import requests
from urllib.parse import urlparse
from requests.adapters import HTTPAdapter

class PerHostConnectionManager:
    def __init__(self):
        self.sessions = {}
        self.host_configs = {
            'api.example.com': {'pool_maxsize': 50, 'timeout': 30},
            'slow-api.example.com': {'pool_maxsize': 10, 'timeout': 120},
            'default': {'pool_maxsize': 20, 'timeout': 60}
        }

    def get_session(self, url):
        host = urlparse(url).netloc

        if host not in self.sessions:
            config = self.host_configs.get(host, self.host_configs['default'])

            session = requests.Session()
            adapter = HTTPAdapter(
                pool_maxsize=config['pool_maxsize'],
                pool_connections=1
            )
            session.mount('http://', adapter)
            session.mount('https://', adapter)
            session.timeout = config['timeout']

            self.sessions[host] = session

        return self.sessions[host]

    def request(self, method, url, **kwargs):
        session = self.get_session(url)
        return session.request(method, url, **kwargs)

# Usage
manager = PerHostConnectionManager()
response = manager.request('GET', 'https://api.example.com/data')

Best Practices for Production

Set Appropriate Limits: Don't overwhelm target servers with too many concurrent connections
Monitor Performance: Track connection pool utilization and response times
Implement Graceful Degradation: Handle connection failures without crashing the application
Use Connection Pooling: Reuse connections to improve performance and reduce overhead
Configure Realistic Timeouts: Balance between reliability and performance
Implement Rate Limiting: Respect server resources and API limits

For complex scraping scenarios involving browser session management, these connection management principles become even more important as browser instances consume significant resources.

Monitoring and Debugging

Connection Pool Metrics

Track important metrics to optimize performance:

# Monitor system-level connection stats
netstat -an | grep :80 | wc -l    # Count HTTP connections
netstat -an | grep :443 | wc -l   # Count HTTPS connections
ss -tuln                          # Show listening sockets

Application-Level Monitoring

import psutil
import time

def monitor_connections():
    """Monitor application connection usage"""
    process = psutil.Process()

    while True:
        connections = process.connections()
        established = len([c for c in connections if c.status == 'ESTABLISHED'])
        time_wait = len([c for c in connections if c.status == 'TIME_WAIT'])

        print(f"Established: {established}, TIME_WAIT: {time_wait}")
        time.sleep(10)

# Run monitoring in background thread
import threading
monitor_thread = threading.Thread(target=monitor_connections, daemon=True)
monitor_thread.start()

By properly managing HTTP connection limits and timeouts, your web scraping applications will be more reliable, efficient, and respectful of target server resources. Regular monitoring and adjustment of these parameters based on actual performance data ensures optimal results in production environments.

Table of contents

How can I manage HTTP connection limits and timeouts?

Understanding HTTP Connection Management

Connection Pooling and Limits

Python with requests and urllib3

JavaScript with axios and node.js

Timeout Configuration

Connection vs Request Timeouts

Dynamic Timeout Adjustment

Browser-Based Scraping Timeout Management

Production-Ready Connection Management

Load Balancing and Circuit Breaker Pattern

Connection Pool Monitoring

Error Handling and Retry Logic

Advanced Connection Management Strategies

Connection Pool Warming

Per-Host Connection Limits

Best Practices for Production

Monitoring and Debugging

Connection Pool Metrics

Application-Level Monitoring

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Related Questions

What are HTTP security headers and how do they impact scraping?

How can I handle HTTP date and time formatting in responses?

What is HTTP content encoding and how do I decode it?

Get Started Now

Support