HTTP request timeouts are a frequent challenge in web scraping, occurring due to server overload, network congestion, or slow response times. This guide covers comprehensive strategies and code examples for handling timeouts effectively across different programming languages.
Understanding Timeout Types
Connection Timeout: Time to establish a connection to the server
Read Timeout: Time to receive response data after connection is established
Total Timeout: Maximum time for the entire request-response cycle
Core Strategies
1. Intelligent Retry Mechanisms
Implement smart retry logic with exponential backoff to handle temporary failures without overwhelming servers.
2. Timeout Configuration
Set appropriate timeout values based on your target websites and network conditions.
3. Circuit Breaker Pattern
Temporarily stop requests to failing endpoints to prevent resource waste.
4. Request Optimization
Use techniques like proxy rotation and User-Agent randomization to reduce blocking.
Python Implementation
Basic Timeout Handling with Requests
import requests
import time
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
def create_robust_session():
"""Create a session with comprehensive retry strategy"""
retry_strategy = Retry(
total=5, # Total number of retries
backoff_factor=2, # Exponential backoff: 0.5s, 1s, 2s, 4s, 8s
status_forcelist=[429, 500, 502, 503, 504, 520, 521, 522, 524],
allowed_methods=["HEAD", "GET", "POST", "PUT", "DELETE", "OPTIONS", "TRACE"]
)
session = requests.Session()
adapter = HTTPAdapter(max_retries=retry_strategy)
session.mount("http://", adapter)
session.mount("https://", adapter)
return session
def scrape_with_timeout(url, timeout=(10, 30)):
"""
Scrape URL with connection and read timeouts
timeout tuple: (connection_timeout, read_timeout)
"""
session = create_robust_session()
try:
response = session.get(
url,
timeout=timeout,
headers={'User-Agent': 'Mozilla/5.0 (compatible; WebScraper/1.0)'}
)
response.raise_for_status()
return response.text
except requests.exceptions.ConnectTimeout:
print(f"Connection timeout for {url}")
return None
except requests.exceptions.ReadTimeout:
print(f"Read timeout for {url}")
return None
except requests.exceptions.Timeout:
print(f"General timeout for {url}")
return None
except requests.exceptions.RequestException as e:
print(f"Request failed for {url}: {e}")
return None
# Usage example
url = "https://example.com/slow-endpoint"
content = scrape_with_timeout(url, timeout=(5, 15))
if content:
print("Successfully scraped content")
Advanced Circuit Breaker Implementation
from datetime import datetime, timedelta
from collections import defaultdict
class CircuitBreaker:
def __init__(self, failure_threshold=5, timeout_duration=300):
self.failure_threshold = failure_threshold
self.timeout_duration = timeout_duration
self.failure_counts = defaultdict(int)
self.last_failure_time = defaultdict(lambda: None)
def can_request(self, domain):
"""Check if requests to domain are allowed"""
if self.failure_counts[domain] < self.failure_threshold:
return True
if self.last_failure_time[domain]:
time_since_failure = datetime.now() - self.last_failure_time[domain]
if time_since_failure > timedelta(seconds=self.timeout_duration):
# Reset circuit breaker
self.failure_counts[domain] = 0
self.last_failure_time[domain] = None
return True
return False
def record_failure(self, domain):
"""Record a failure for the domain"""
self.failure_counts[domain] += 1
self.last_failure_time[domain] = datetime.now()
def record_success(self, domain):
"""Reset failure count on success"""
self.failure_counts[domain] = 0
self.last_failure_time[domain] = None
# Usage with circuit breaker
circuit_breaker = CircuitBreaker()
def scrape_with_circuit_breaker(url):
from urllib.parse import urlparse
domain = urlparse(url).netloc
if not circuit_breaker.can_request(domain):
print(f"Circuit breaker OPEN for {domain}")
return None
try:
response = requests.get(url, timeout=(10, 30))
response.raise_for_status()
circuit_breaker.record_success(domain)
return response.text
except requests.exceptions.Timeout:
circuit_breaker.record_failure(domain)
print(f"Timeout recorded for {domain}")
return None
JavaScript/Node.js Implementation
Axios with Custom Retry Logic
const axios = require('axios');
class TimeoutHandler {
constructor(maxRetries = 3, baseDelay = 1000) {
this.maxRetries = maxRetries;
this.baseDelay = baseDelay;
}
async fetchWithRetry(url, options = {}) {
const config = {
timeout: 10000, // 10 second timeout
headers: {
'User-Agent': 'Mozilla/5.0 (compatible; WebScraper/1.0)'
},
...options
};
for (let attempt = 0; attempt <= this.maxRetries; attempt++) {
try {
const response = await axios.get(url, config);
return response.data;
} catch (error) {
const isTimeout = error.code === 'ECONNABORTED' ||
error.code === 'ETIMEDOUT';
if (isTimeout && attempt < this.maxRetries) {
const delay = this.baseDelay * Math.pow(2, attempt);
console.log(`Timeout on attempt ${attempt + 1}. Retrying in ${delay}ms...`);
await this.sleep(delay);
continue;
}
if (isTimeout) {
throw new Error(`Request timed out after ${this.maxRetries + 1} attempts`);
}
throw error;
}
}
}
sleep(ms) {
return new Promise(resolve => setTimeout(resolve, ms));
}
}
// Usage
const handler = new TimeoutHandler(5, 1000);
async function scrapeUrl(url) {
try {
const content = await handler.fetchWithRetry(url, {
timeout: 15000 // 15 second timeout
});
console.log('Successfully scraped content');
return content;
} catch (error) {
console.error('Failed to scrape:', error.message);
return null;
}
}
// Example usage
scrapeUrl('https://example.com/api/data');
Fetch API with AbortController
class FetchTimeout {
static async fetchWithTimeout(url, options = {}, timeoutMs = 10000) {
const controller = new AbortController();
const timeoutId = setTimeout(() => controller.abort(), timeoutMs);
try {
const response = await fetch(url, {
...options,
signal: controller.signal
});
clearTimeout(timeoutId);
if (!response.ok) {
throw new Error(`HTTP ${response.status}: ${response.statusText}`);
}
return await response.text();
} catch (error) {
clearTimeout(timeoutId);
if (error.name === 'AbortError') {
throw new Error(`Request timed out after ${timeoutMs}ms`);
}
throw error;
}
}
}
// Usage
async function scrapeWithAbort(url) {
try {
const content = await FetchTimeout.fetchWithTimeout(url, {
headers: {
'User-Agent': 'Mozilla/5.0 (compatible; WebScraper/1.0)'
}
}, 8000); // 8 second timeout
return content;
} catch (error) {
console.error('Scraping failed:', error.message);
return null;
}
}
Java Implementation
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
import java.time.Duration;
import java.util.concurrent.CompletableFuture;
public class TimeoutScraper {
private final HttpClient client;
public TimeoutScraper() {
this.client = HttpClient.newBuilder()
.connectTimeout(Duration.ofSeconds(10))
.build();
}
public CompletableFuture<String> scrapeAsync(String url) {
HttpRequest request = HttpRequest.newBuilder()
.uri(URI.create(url))
.timeout(Duration.ofSeconds(30))
.header("User-Agent", "Mozilla/5.0 (compatible; WebScraper/1.0)")
.build();
return client.sendAsync(request, HttpResponse.BodyHandlers.ofString())
.thenApply(HttpResponse::body)
.exceptionally(throwable -> {
System.err.println("Request failed: " + throwable.getMessage());
return null;
});
}
public String scrapeWithRetry(String url, int maxRetries) {
for (int attempt = 0; attempt <= maxRetries; attempt++) {
try {
return scrapeAsync(url).get(35, TimeUnit.SECONDS);
} catch (TimeoutException | ExecutionException e) {
if (attempt == maxRetries) {
System.err.println("Max retries exceeded for: " + url);
return null;
}
int delay = (int) Math.pow(2, attempt) * 1000;
System.out.println("Retrying in " + delay + "ms...");
Thread.sleep(delay);
}
}
return null;
}
}
cURL Advanced Usage
#!/bin/bash
# Function to scrape with comprehensive timeout handling
scrape_with_timeout() {
local url="$1"
local max_attempts=3
local base_delay=2
for ((attempt=1; attempt<=max_attempts; attempt++)); do
echo "Attempt $attempt for $url"
if curl \
--max-time 30 \ # Total timeout
--connect-timeout 10 \ # Connection timeout
--retry 0 \ # Disable curl's built-in retry
--fail \ # Fail on HTTP errors
--silent \ # Quiet mode
--show-error \ # Show errors
--user-agent "Mozilla/5.0 (compatible; WebScraper/1.0)" \
"$url" -o response.html; then
echo "Successfully scraped $url"
return 0
fi
if [ $attempt -lt $max_attempts ]; then
local delay=$((base_delay ** attempt))
echo "Request failed. Retrying in ${delay}s..."
sleep $delay
fi
done
echo "Failed to scrape $url after $max_attempts attempts"
return 1
}
# Usage
scrape_with_timeout "https://example.com/api/data"
Best Practices
1. Timeout Configuration Guidelines
- Connection timeout: 5-15 seconds for most websites
- Read timeout: 15-60 seconds depending on expected response size
- Total timeout: Should accommodate both connection and read timeouts
2. Retry Strategy Recommendations
- Use exponential backoff starting from 1-2 seconds
- Limit total retry attempts to 3-5 to avoid excessive delays
- Include jitter to prevent thundering herd effects
3. Monitoring and Logging
import logging
# Configure logging for timeout tracking
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
def log_timeout_metrics(url, attempt, timeout_type, duration):
logger.info(f"Timeout metrics - URL: {url}, Attempt: {attempt}, "
f"Type: {timeout_type}, Duration: {duration}s")
4. Resource Management
- Use connection pooling to reduce connection overhead
- Implement proper session management
- Set reasonable limits on concurrent requests
- Monitor memory usage for large-scale scraping
Common Pitfalls to Avoid
- Setting timeouts too low: May cause unnecessary failures
- Infinite retries: Can lead to resource exhaustion
- Ignoring different timeout types: Connection vs read timeouts serve different purposes
- Not implementing circuit breakers: Can overwhelm failing servers
- Uniform timeout values: Different websites may need different timeout configurations
Remember to always respect robots.txt, implement reasonable delays between requests, and follow website terms of service when scraping data.