How do you handle API timeouts and connection issues?
API timeouts and connection issues are inevitable challenges in web scraping and API integration. Network instability, server overload, and temporary outages can cause requests to fail or hang indefinitely. Implementing robust error handling and timeout management is crucial for building reliable scraping applications that can gracefully recover from these issues.
Understanding API Timeouts
API timeouts occur when a request takes longer than the specified time limit to complete. There are typically two types of timeouts to consider:
- Connection timeout: Time limit for establishing a connection to the server
- Read timeout: Time limit for receiving a response after the connection is established
Basic Timeout Configuration
Python with Requests
import requests
from requests.exceptions import Timeout, ConnectionError, RequestException
import time
import random
def make_request_with_timeout(url, timeout=30):
try:
response = requests.get(
url,
timeout=(5, 30), # (connection_timeout, read_timeout)
headers={'User-Agent': 'Your-App/1.0'}
)
response.raise_for_status()
return response
except Timeout:
print(f"Request timed out for {url}")
raise
except ConnectionError:
print(f"Connection error for {url}")
raise
except RequestException as e:
print(f"Request failed: {e}")
raise
JavaScript with Fetch
async function makeRequestWithTimeout(url, timeoutMs = 30000) {
const controller = new AbortController();
const timeout = setTimeout(() => controller.abort(), timeoutMs);
try {
const response = await fetch(url, {
signal: controller.signal,
headers: {
'User-Agent': 'Your-App/1.0'
}
});
clearTimeout(timeout);
if (!response.ok) {
throw new Error(`HTTP ${response.status}: ${response.statusText}`);
}
return response;
} catch (error) {
clearTimeout(timeout);
if (error.name === 'AbortError') {
throw new Error('Request timed out');
}
throw error;
}
}
Implementing Retry Logic
Retry logic is essential for handling temporary failures. The key is to distinguish between retryable and non-retryable errors.
Python Retry Implementation
import time
import random
from functools import wraps
def retry_with_backoff(max_retries=3, base_delay=1, max_delay=60, backoff_factor=2):
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
retries = 0
while retries < max_retries:
try:
return func(*args, **kwargs)
except (Timeout, ConnectionError) as e:
retries += 1
if retries >= max_retries:
print(f"Max retries ({max_retries}) exceeded")
raise
# Calculate delay with exponential backoff and jitter
delay = min(base_delay * (backoff_factor ** (retries - 1)), max_delay)
jitter = random.uniform(0.1, 0.3) * delay
total_delay = delay + jitter
print(f"Retry {retries}/{max_retries} after {total_delay:.2f}s")
time.sleep(total_delay)
except RequestException as e:
# Check if error is retryable
if hasattr(e, 'response') and e.response.status_code in [429, 502, 503, 504]:
retries += 1
if retries >= max_retries:
raise
# Handle rate limiting with longer delay
if e.response.status_code == 429:
retry_after = e.response.headers.get('Retry-After', 60)
time.sleep(int(retry_after))
else:
delay = base_delay * (backoff_factor ** (retries - 1))
time.sleep(delay)
else:
# Non-retryable error
raise
return None
return wrapper
return decorator
@retry_with_backoff(max_retries=3)
def fetch_api_data(url):
return make_request_with_timeout(url)
JavaScript Retry Implementation
async function retryWithBackoff(
fn,
maxRetries = 3,
baseDelay = 1000,
maxDelay = 60000,
backoffFactor = 2
) {
let retries = 0;
while (retries < maxRetries) {
try {
return await fn();
} catch (error) {
retries++;
// Check if error is retryable
const isRetryable =
error.name === 'AbortError' ||
error.message.includes('network') ||
error.message.includes('timeout') ||
(error.status && [429, 502, 503, 504].includes(error.status));
if (!isRetryable || retries >= maxRetries) {
throw error;
}
// Calculate delay with exponential backoff and jitter
const delay = Math.min(
baseDelay * Math.pow(backoffFactor, retries - 1),
maxDelay
);
const jitter = Math.random() * 0.3 * delay;
const totalDelay = delay + jitter;
console.log(`Retry ${retries}/${maxRetries} after ${totalDelay}ms`);
await new Promise(resolve => setTimeout(resolve, totalDelay));
}
}
}
// Usage
async function fetchApiData(url) {
return await retryWithBackoff(
() => makeRequestWithTimeout(url),
3, // max retries
1000, // base delay (1 second)
30000 // max delay (30 seconds)
);
}
Circuit Breaker Pattern
The circuit breaker pattern prevents cascading failures by temporarily stopping requests to a failing service.
Python Circuit Breaker
import time
from enum import Enum
class CircuitState(Enum):
CLOSED = "closed"
OPEN = "open"
HALF_OPEN = "half_open"
class CircuitBreaker:
def __init__(self, failure_threshold=5, timeout=60, recovery_timeout=30):
self.failure_threshold = failure_threshold
self.timeout = timeout
self.recovery_timeout = recovery_timeout
self.failure_count = 0
self.last_failure_time = None
self.state = CircuitState.CLOSED
def call(self, func, *args, **kwargs):
if self.state == CircuitState.OPEN:
if time.time() - self.last_failure_time > self.recovery_timeout:
self.state = CircuitState.HALF_OPEN
print("Circuit breaker half-open: testing service")
else:
raise Exception("Circuit breaker is OPEN")
try:
result = func(*args, **kwargs)
self.on_success()
return result
except Exception as e:
self.on_failure()
raise
def on_success(self):
self.failure_count = 0
if self.state == CircuitState.HALF_OPEN:
self.state = CircuitState.CLOSED
print("Circuit breaker closed: service recovered")
def on_failure(self):
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = CircuitState.OPEN
print(f"Circuit breaker opened after {self.failure_count} failures")
# Usage
breaker = CircuitBreaker(failure_threshold=3, recovery_timeout=60)
def protected_api_call(url):
return breaker.call(fetch_api_data, url)
Advanced Connection Management
Connection Pooling and Session Management
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
class RobustAPIClient:
def __init__(self, base_url, max_retries=3, pool_connections=10, pool_maxsize=10):
self.base_url = base_url
self.session = requests.Session()
# Configure retry strategy
retry_strategy = Retry(
total=max_retries,
status_forcelist=[429, 500, 502, 503, 504],
method_whitelist=["HEAD", "GET", "OPTIONS"],
backoff_factor=1,
raise_on_status=False
)
# Configure HTTP adapter with connection pooling
adapter = HTTPAdapter(
max_retries=retry_strategy,
pool_connections=pool_connections,
pool_maxsize=pool_maxsize
)
self.session.mount("http://", adapter)
self.session.mount("https://", adapter)
# Set default headers
self.session.headers.update({
'User-Agent': 'RobustAPIClient/1.0',
'Accept': 'application/json',
'Connection': 'keep-alive'
})
def get(self, endpoint, **kwargs):
url = f"{self.base_url.rstrip('/')}/{endpoint.lstrip('/')}"
# Set default timeout if not provided
kwargs.setdefault('timeout', (5, 30))
try:
response = self.session.get(url, **kwargs)
response.raise_for_status()
return response
except requests.exceptions.RequestException as e:
print(f"Request failed: {e}")
raise
def close(self):
self.session.close()
# Usage
client = RobustAPIClient("https://api.example.com")
try:
response = client.get("/data")
data = response.json()
finally:
client.close()
JavaScript with Axios and Interceptors
import axios from 'axios';
class RobustAPIClient {
constructor(baseURL, options = {}) {
this.client = axios.create({
baseURL,
timeout: options.timeout || 30000,
headers: {
'User-Agent': 'RobustAPIClient/1.0',
'Accept': 'application/json'
}
});
this.setupInterceptors();
this.maxRetries = options.maxRetries || 3;
}
setupInterceptors() {
// Request interceptor for debugging
this.client.interceptors.request.use(
config => {
console.log(`Making request to: ${config.url}`);
return config;
},
error => Promise.reject(error)
);
// Response interceptor for error handling
this.client.interceptors.response.use(
response => response,
async error => {
const config = error.config;
// Initialize retry count
config.__retryCount = config.__retryCount || 0;
// Check if we should retry
const shouldRetry =
config.__retryCount < this.maxRetries &&
(error.code === 'ECONNABORTED' ||
error.response?.status >= 500 ||
error.response?.status === 429);
if (shouldRetry) {
config.__retryCount++;
// Calculate delay
const delay = Math.pow(2, config.__retryCount) * 1000;
console.log(`Retrying request (${config.__retryCount}/${this.maxRetries}) after ${delay}ms`);
await new Promise(resolve => setTimeout(resolve, delay));
return this.client(config);
}
return Promise.reject(error);
}
);
}
async get(endpoint, config = {}) {
try {
const response = await this.client.get(endpoint, config);
return response.data;
} catch (error) {
console.error(`API request failed: ${error.message}`);
throw error;
}
}
}
Monitoring and Logging
Effective monitoring helps identify patterns in API failures and optimize your retry strategies.
Python Logging Implementation
import logging
import time
from contextlib import contextmanager
# Configure logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
@contextmanager
def api_request_context(url, method='GET'):
start_time = time.time()
logger.info(f"Starting {method} request to {url}")
try:
yield
duration = time.time() - start_time
logger.info(f"Request completed successfully in {duration:.2f}s")
except Exception as e:
duration = time.time() - start_time
logger.error(f"Request failed after {duration:.2f}s: {str(e)}")
raise
def monitored_api_call(url):
with api_request_context(url):
return fetch_api_data(url)
Command Line Testing Tools
Using cURL with Timeout Options
# Set connection timeout (5s) and max time (30s)
curl --connect-timeout 5 --max-time 30 https://api.example.com/data
# Retry failed requests with delays
curl --retry 3 --retry-delay 2 --retry-max-time 60 https://api.example.com/data
# Show detailed timing information
curl -w "@curl-format.txt" https://api.example.com/data
Create a curl-format.txt
file for detailed timing:
time_namelookup: %{time_namelookup}s\n
time_connect: %{time_connect}s\n
time_appconnect: %{time_appconnect}s\n
time_pretransfer: %{time_pretransfer}s\n
time_redirect: %{time_redirect}s\n
time_starttransfer: %{time_starttransfer}s\n
----------\n
time_total: %{time_total}s\n
Using HTTPie for Testing
# Test with timeout
http --timeout=30 GET https://api.example.com/data
# Test with retry logic using shell scripting
for i in {1..3}; do
http GET https://api.example.com/data && break
echo "Attempt $i failed, retrying..."
sleep $((2**i))
done
Best Practices Summary
- Set appropriate timeouts: Use both connection and read timeouts
- Implement exponential backoff: Reduce server load during retry attempts
- Add jitter to delays: Prevent thundering herd problems
- Use circuit breakers: Prevent cascading failures
- Monitor and log: Track failure patterns and performance metrics
- Handle rate limiting: Respect
Retry-After
headers and implement proper delays - Distinguish error types: Only retry transient failures
- Use connection pooling: Improve performance and resource utilization
When dealing with complex scraping scenarios, you might also need to consider how to handle timeouts in Puppeteer for browser-based scraping, or implement proper error handling strategies for headless browser automation.
By implementing these strategies, you can build resilient API clients that gracefully handle network issues and provide reliable data extraction capabilities for your web scraping applications.