How do I set connection and read timeouts separately?
Setting connection and read timeouts separately in Python's Requests library is crucial for building robust web scraping applications. This granular control allows you to handle different types of network delays more effectively and prevents your applications from hanging indefinitely.
Understanding Connection vs Read Timeouts
Before diving into implementation, it's important to understand the difference between these two timeout types:
- Connection timeout: The maximum time to wait for establishing a connection to the server
- Read timeout: The maximum time to wait for the server to send data after the connection is established
These timeouts serve different purposes and should be configured based on your specific use case and network conditions.
Basic Syntax for Separate Timeouts
The Requests library accepts a tuple for the timeout
parameter, where the first value is the connection timeout and the second is the read timeout:
import requests
# Basic syntax: timeout=(connect_timeout, read_timeout)
response = requests.get('https://example.com', timeout=(5, 30))
Practical Examples
Simple GET Request with Separate Timeouts
import requests
from requests.exceptions import ConnectTimeout, ReadTimeout, RequestException
def fetch_with_timeouts(url):
try:
# 3 seconds to connect, 15 seconds to read
response = requests.get(url, timeout=(3, 15))
return response
except ConnectTimeout:
print("Connection timeout: Server took too long to establish connection")
except ReadTimeout:
print("Read timeout: Server took too long to send data")
except RequestException as e:
print(f"Request failed: {e}")
return None
# Usage
url = "https://httpbin.org/delay/2"
response = fetch_with_timeouts(url)
if response:
print(f"Status: {response.status_code}")
Session-based Requests with Timeouts
For multiple requests, using a session with consistent timeout settings is more efficient:
import requests
def create_session_with_timeouts(connect_timeout=5, read_timeout=30):
session = requests.Session()
# Create an adapter with timeout settings
adapter = requests.adapters.HTTPAdapter()
session.mount('http://', adapter)
session.mount('https://', adapter)
# Set default timeout for all requests in this session
session.timeout = (connect_timeout, read_timeout)
return session
# Usage
session = create_session_with_timeouts(connect_timeout=3, read_timeout=20)
urls = [
'https://httpbin.org/delay/1',
'https://httpbin.org/delay/2',
'https://httpbin.org/delay/3'
]
for url in urls:
try:
response = session.get(url, timeout=(3, 20))
print(f"URL: {url}, Status: {response.status_code}")
except (ConnectTimeout, ReadTimeout) as e:
print(f"Timeout error for {url}: {e}")
Advanced Timeout Configuration
For more complex scenarios, you can create a wrapper class that handles different timeout strategies:
import requests
import time
from typing import Optional, Tuple
class TimeoutRequestsWrapper:
def __init__(self, default_connect_timeout=5, default_read_timeout=30):
self.default_connect_timeout = default_connect_timeout
self.default_read_timeout = default_read_timeout
self.session = requests.Session()
def get(self, url: str,
connect_timeout: Optional[float] = None,
read_timeout: Optional[float] = None,
**kwargs) -> Optional[requests.Response]:
# Use custom timeouts or fall back to defaults
conn_timeout = connect_timeout or self.default_connect_timeout
read_timeout_val = read_timeout or self.default_read_timeout
timeout_tuple = (conn_timeout, read_timeout_val)
try:
start_time = time.time()
response = self.session.get(url, timeout=timeout_tuple, **kwargs)
elapsed_time = time.time() - start_time
print(f"Request completed in {elapsed_time:.2f} seconds")
return response
except ConnectTimeout:
print(f"Connection timeout ({conn_timeout}s) reached for {url}")
except ReadTimeout:
print(f"Read timeout ({read_timeout_val}s) reached for {url}")
except RequestException as e:
print(f"Request failed: {e}")
return None
# Usage example
wrapper = TimeoutRequestsWrapper(default_connect_timeout=2, default_read_timeout=10)
# Use default timeouts
response1 = wrapper.get('https://httpbin.org/delay/1')
# Override timeouts for specific request
response2 = wrapper.get(
'https://httpbin.org/delay/5',
connect_timeout=1,
read_timeout=20
)
JavaScript/Node.js Equivalent
For JavaScript developers using libraries like Axios, similar timeout control is available:
const axios = require('axios');
// Configure separate timeouts in Axios
const client = axios.create({
timeout: 30000, // Overall timeout (includes both connection and response)
});
// For more granular control, use a custom agent
const http = require('http');
const https = require('https');
const httpAgent = new http.Agent({
timeout: 5000, // Connection timeout
});
const httpsAgent = new https.Agent({
timeout: 5000, // Connection timeout
});
const customClient = axios.create({
timeout: 30000, // Read timeout
httpAgent: httpAgent,
httpsAgent: httpsAgent
});
async function fetchWithTimeouts(url) {
try {
const response = await customClient.get(url);
return response;
} catch (error) {
if (error.code === 'ECONNABORTED') {
console.log('Request timeout');
} else if (error.code === 'ECONNREFUSED') {
console.log('Connection refused');
} else {
console.log('Request failed:', error.message);
}
return null;
}
}
Best Practices and Recommendations
Choosing Appropriate Timeout Values
Connection timeout: Usually should be shorter (3-10 seconds)
- Reflects network latency and server availability
- Longer values may indicate server issues
Read timeout: Can be longer (15-60 seconds)
- Depends on expected response processing time
- Consider the complexity of the requested resource
Environment-Specific Configuration
import os
import requests
class EnvironmentAwareTimeouts:
def __init__(self):
# Different timeouts for different environments
env = os.getenv('ENVIRONMENT', 'development')
if env == 'production':
self.connect_timeout = 5
self.read_timeout = 30
elif env == 'testing':
self.connect_timeout = 2
self.read_timeout = 10
else: # development
self.connect_timeout = 10
self.read_timeout = 60
def make_request(self, url, **kwargs):
timeout = (self.connect_timeout, self.read_timeout)
return requests.get(url, timeout=timeout, **kwargs)
# Usage
timeout_manager = EnvironmentAwareTimeouts()
response = timeout_manager.make_request('https://api.example.com/data')
Error Handling and Retry Logic
When working with timeouts, implementing proper retry logic is essential:
import requests
import time
from typing import Optional
def request_with_retry(url: str,
max_retries: int = 3,
connect_timeout: float = 5,
read_timeout: float = 30,
backoff_factor: float = 1) -> Optional[requests.Response]:
for attempt in range(max_retries):
try:
response = requests.get(url, timeout=(connect_timeout, read_timeout))
return response
except ConnectTimeout:
print(f"Attempt {attempt + 1}: Connection timeout")
except ReadTimeout:
print(f"Attempt {attempt + 1}: Read timeout")
except RequestException as e:
print(f"Attempt {attempt + 1}: Request failed - {e}")
if attempt < max_retries - 1:
wait_time = backoff_factor * (2 ** attempt)
print(f"Retrying in {wait_time} seconds...")
time.sleep(wait_time)
return None
# Usage
response = request_with_retry(
'https://unreliable-api.example.com/data',
max_retries=3,
connect_timeout=3,
read_timeout=15,
backoff_factor=1.5
)
Integration with Web Scraping Workflows
When building web scraping applications, timeout configuration becomes even more critical. For complex scenarios involving dynamic content that loads after page interactions, you might need to combine Requests with browser automation tools.
import requests
from concurrent.futures import ThreadPoolExecutor, as_completed
from typing import List, Dict, Any
class WebScrapingTimeoutManager:
def __init__(self, max_workers: int = 5):
self.max_workers = max_workers
self.session = requests.Session()
# Set reasonable defaults for web scraping
self.session.timeout = (5, 30) # 5s connect, 30s read
# Add common headers to avoid detection
self.session.headers.update({
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
})
def scrape_urls(self, urls: List[str]) -> List[Dict[str, Any]]:
results = []
with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
# Submit all requests
future_to_url = {
executor.submit(self._fetch_single_url, url): url
for url in urls
}
# Process completed requests
for future in as_completed(future_to_url):
url = future_to_url[future]
try:
result = future.result()
results.append(result)
except Exception as e:
print(f"Failed to process {url}: {e}")
results.append({
'url': url,
'status': 'error',
'error': str(e)
})
return results
def _fetch_single_url(self, url: str) -> Dict[str, Any]:
try:
response = self.session.get(url)
return {
'url': url,
'status_code': response.status_code,
'content_length': len(response.content),
'response_time': response.elapsed.total_seconds(),
'status': 'success'
}
except (ConnectTimeout, ReadTimeout) as e:
return {
'url': url,
'status': 'timeout',
'error': str(e)
}
# Usage
scraper = WebScrapingTimeoutManager(max_workers=10)
urls = [
'https://example1.com',
'https://example2.com',
'https://example3.com'
]
results = scraper.scrape_urls(urls)
for result in results:
print(f"URL: {result['url']}, Status: {result['status']}")
Monitoring and Debugging Timeouts
To better understand timeout behavior in your applications, implement logging and monitoring:
import requests
import logging
import time
from contextlib import contextmanager
# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
@contextmanager
def timeout_monitor(url: str):
start_time = time.time()
try:
yield
except ConnectTimeout:
elapsed = time.time() - start_time
logger.warning(f"Connection timeout for {url} after {elapsed:.2f}s")
raise
except ReadTimeout:
elapsed = time.time() - start_time
logger.warning(f"Read timeout for {url} after {elapsed:.2f}s")
raise
else:
elapsed = time.time() - start_time
logger.info(f"Request to {url} completed in {elapsed:.2f}s")
def monitored_request(url: str, connect_timeout: float = 5, read_timeout: float = 30):
with timeout_monitor(url):
return requests.get(url, timeout=(connect_timeout, read_timeout))
# Usage
response = monitored_request('https://httpbin.org/delay/2')
Command Line Testing
You can test timeout behavior using curl to understand how different timeouts affect your requests:
# Test connection timeout (time to establish connection)
curl --connect-timeout 5 https://example.com
# Test max-time (overall request timeout including reading)
curl --max-time 30 https://example.com
# Combine both for comprehensive timeout control
curl --connect-timeout 5 --max-time 30 https://example.com
Conclusion
Setting connection and read timeouts separately provides fine-grained control over your HTTP requests and is essential for building robust web scraping applications. By understanding the difference between these timeout types and implementing appropriate error handling, you can create more reliable and predictable network operations.
Remember to choose timeout values based on your specific use case, network conditions, and performance requirements. For applications that need to handle complex authentication flows or dynamic content, consider combining timeout strategies with appropriate retry logic and monitoring to ensure optimal performance and reliability.