Table of contents

How do I work with persistent connections in urllib3?

Persistent connections in urllib3 allow you to reuse HTTP connections across multiple requests, significantly improving performance by avoiding the overhead of establishing new connections. This is achieved through connection pooling, which is enabled by default in urllib3.

Understanding Connection Pooling

Connection pooling maintains a pool of open connections that can be reused for subsequent requests to the same host. This reduces: - Connection establishment time - TCP handshake overhead - SSL/TLS negotiation time (for HTTPS) - Overall latency

Basic Usage with PoolManager

Creating a PoolManager

The PoolManager class handles connection pooling automatically:

import urllib3

# Create a PoolManager instance
http = urllib3.PoolManager()

Making Requests

Once created, the PoolManager reuses connections automatically:

# First request - establishes connection
response1 = http.request('GET', 'https://httpbin.org/ip')
print(f"Response 1: {response1.data.decode()}")

# Second request - reuses existing connection
response2 = http.request('GET', 'https://httpbin.org/user-agent')
print(f"Response 2: {response2.data.decode()}")

# Third request to same host - connection reused again
response3 = http.request('POST', 'https://httpbin.org/post', 
                        fields={'key': 'value'})
print(f"Response 3 status: {response3.status}")

Advanced Configuration

Customizing Pool Parameters

You can fine-tune the connection pool behavior:

import urllib3
from urllib3.util.retry import Retry

# Advanced PoolManager configuration
http = urllib3.PoolManager(
    num_pools=10,           # Number of different hosts to pool
    maxsize=20,            # Max connections per pool
    block=False,           # Don't block when pool is full
    retries=Retry(
        total=3,           # Total retry attempts
        backoff_factor=0.3, # Delay between retries
        status_forcelist=[500, 502, 503, 504]  # HTTP codes to retry
    ),
    timeout=urllib3.Timeout(
        connect=5.0,       # Connection timeout
        read=30.0          # Read timeout
    )
)

Working with Multiple Hosts

import urllib3
import time

http = urllib3.PoolManager(num_pools=3, maxsize=5)

# Requests to different hosts - each gets its own pool
hosts = [
    'https://httpbin.org',
    'https://jsonplaceholder.typicode.com',
    'https://api.github.com'
]

for host in hosts:
    start_time = time.time()

    # First request to each host
    response = http.request('GET', f'{host}/headers' if 'httpbin' in host else host)
    first_request_time = time.time() - start_time

    # Second request to same host (should be faster due to connection reuse)
    start_time = time.time()
    response = http.request('GET', f'{host}/ip' if 'httpbin' in host else host)
    second_request_time = time.time() - start_time

    print(f"Host: {host}")
    print(f"First request: {first_request_time:.3f}s")
    print(f"Second request: {second_request_time:.3f}s")
    print(f"Speed improvement: {((first_request_time - second_request_time) / first_request_time * 100):.1f}%\n")

HTTPConnectionPool for Single Hosts

For applications that primarily communicate with a single host, use HTTPConnectionPool directly:

import urllib3

# Create a pool for a specific host
pool = urllib3.HTTPConnectionPool('httpbin.org', port=443, 
                                 maxsize=10, block=True)

# Make requests using the pool
response = pool.request('GET', '/json')
print(response.data.decode())

# Clean up
pool.close()

Connection Pool Management

Monitoring Pool Status

import urllib3

http = urllib3.PoolManager(maxsize=5)

# Make some requests
for i in range(10):
    response = http.request('GET', f'https://httpbin.org/delay/{i%3}')

# Check pool statistics
for pool_key, pool in http.pools.items():
    print(f"Pool {pool_key}:")
    print(f"  Pool size: {pool.pool.qsize()}")
    print(f"  Pool maxsize: {pool.maxsize}")

Proper Cleanup

Always clean up resources when done:

import urllib3
import atexit

http = urllib3.PoolManager()

# Register cleanup function
def cleanup():
    http.clear()
    print("Connection pools cleared")

atexit.register(cleanup)

# Your application code here
response = http.request('GET', 'https://httpbin.org/get')

Error Handling with Persistent Connections

import urllib3
from urllib3.exceptions import MaxRetryError, NewConnectionError, TimeoutError

http = urllib3.PoolManager(
    retries=urllib3.Retry(total=3, backoff_factor=0.3),
    timeout=urllib3.Timeout(connect=5.0, read=10.0)
)

try:
    response = http.request('GET', 'https://httpbin.org/delay/2')
    print(f"Status: {response.status}")
    print(f"Data: {response.data.decode()}")

except MaxRetryError as e:
    print(f"Max retries exceeded: {e}")
except NewConnectionError as e:
    print(f"Connection failed: {e}")
except TimeoutError as e:
    print(f"Request timed out: {e}")

Performance Comparison

Here's a practical example showing the performance benefits:

import urllib3
import time
import requests  # For comparison

def test_urllib3_with_pooling():
    http = urllib3.PoolManager()
    start = time.time()

    for i in range(10):
        response = http.request('GET', 'https://httpbin.org/uuid')

    return time.time() - start

def test_requests_without_session():
    start = time.time()

    for i in range(10):
        response = requests.get('https://httpbin.org/uuid')

    return time.time() - start

# Run tests
urllib3_time = test_urllib3_with_pooling()
requests_time = test_requests_without_session()

print(f"urllib3 with pooling: {urllib3_time:.2f}s")
print(f"requests without session: {requests_time:.2f}s")
print(f"Performance improvement: {((requests_time - urllib3_time) / requests_time * 100):.1f}%")

Best Practices

  1. Reuse PoolManager instances: Create one PoolManager and reuse it throughout your application
  2. Configure appropriate pool sizes: Set maxsize based on your concurrent request needs
  3. Handle timeouts: Always set reasonable connection and read timeouts
  4. Implement retry logic: Use urllib3.Retry for robust error handling
  5. Clean up resources: Call clear() when shutting down your application
  6. Monitor pool usage: Keep track of pool statistics in production applications
  7. Use context managers: Consider wrapping PoolManager usage in context managers for automatic cleanup

By leveraging persistent connections in urllib3, you can significantly improve the performance of your HTTP-based applications while maintaining clean, maintainable code.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon