How do I set a timeout for requests to prevent hanging?
Setting proper timeouts for HTTP requests is crucial for building robust web scraping applications. Without timeouts, your requests can hang indefinitely, causing your application to freeze or consume excessive resources. This guide covers how to implement timeouts across different programming languages and libraries.
Understanding Request Timeouts
A timeout defines the maximum amount of time your application will wait for a response before giving up. There are typically two types of timeouts:
- Connection timeout: Time to establish a connection to the server
- Read timeout: Time to wait for data after the connection is established
Python Requests Library
The Python requests
library provides several ways to set timeouts:
Basic Timeout
import requests
from requests.exceptions import Timeout, RequestException
try:
# Set timeout to 10 seconds for both connection and read
response = requests.get('https://example.com', timeout=10)
print(response.status_code)
except Timeout:
print("Request timed out")
except RequestException as e:
print(f"Request failed: {e}")
Separate Connection and Read Timeouts
import requests
try:
# Connection timeout: 5 seconds, Read timeout: 10 seconds
response = requests.get(
'https://example.com',
timeout=(5, 10)
)
print(response.text)
except requests.exceptions.ConnectTimeout:
print("Connection timeout occurred")
except requests.exceptions.ReadTimeout:
print("Read timeout occurred")
except requests.exceptions.Timeout:
print("Request timed out")
Session-Level Timeouts
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
# Create a session with default timeout
session = requests.Session()
# Configure retry strategy with timeout
retry_strategy = Retry(
total=3,
backoff_factor=1,
status_forcelist=[429, 500, 502, 503, 504],
)
adapter = HTTPAdapter(max_retries=retry_strategy)
session.mount("http://", adapter)
session.mount("https://", adapter)
try:
response = session.get('https://example.com', timeout=15)
print(response.status_code)
except Exception as e:
print(f"Request failed: {e}")
finally:
session.close()
JavaScript Fetch API
Modern JavaScript provides timeout functionality through AbortController:
Basic Fetch with Timeout
async function fetchWithTimeout(url, timeout = 10000) {
const controller = new AbortController();
const timeoutId = setTimeout(() => controller.abort(), timeout);
try {
const response = await fetch(url, {
signal: controller.signal,
headers: {
'User-Agent': 'Mozilla/5.0 (compatible; WebScraper/1.0)'
}
});
clearTimeout(timeoutId);
if (!response.ok) {
throw new Error(`HTTP error! status: ${response.status}`);
}
return await response.text();
} catch (error) {
if (error.name === 'AbortError') {
throw new Error('Request timed out');
}
throw error;
}
}
// Usage
fetchWithTimeout('https://example.com', 5000)
.then(data => console.log(data))
.catch(error => console.error('Error:', error.message));
Node.js with Axios
const axios = require('axios');
// Create axios instance with default timeout
const client = axios.create({
timeout: 10000, // 10 seconds
headers: {
'User-Agent': 'Mozilla/5.0 (compatible; WebScraper/1.0)'
}
});
async function scrapeWithTimeout(url) {
try {
const response = await client.get(url, {
timeout: 15000 // Override default timeout for this request
});
return response.data;
} catch (error) {
if (error.code === 'ECONNABORTED') {
console.error('Request timed out');
} else {
console.error('Request failed:', error.message);
}
throw error;
}
}
cURL Command Line
Set timeouts directly in cURL commands:
# Connection timeout: 10 seconds, Max time: 30 seconds
curl --connect-timeout 10 --max-time 30 https://example.com
# DNS resolution timeout
curl --dns-timeout 5 --connect-timeout 10 --max-time 30 https://example.com
# With retry on failure
curl --retry 3 --retry-delay 2 --connect-timeout 10 --max-time 30 https://example.com
PHP with cURL
<?php
function fetchWithTimeout($url, $timeout = 30) {
$ch = curl_init();
curl_setopt_array($ch, [
CURLOPT_URL => $url,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_MAXREDIRS => 5,
CURLOPT_TIMEOUT => $timeout, // Total timeout
CURLOPT_CONNECTTIMEOUT => 10, // Connection timeout
CURLOPT_DNS_CACHE_TIMEOUT => 120, // DNS cache timeout
CURLOPT_USERAGENT => 'Mozilla/5.0 (compatible; WebScraper/1.0)',
CURLOPT_SSL_VERIFYPEER => true,
CURLOPT_SSL_VERIFYHOST => 2,
]);
$response = curl_exec($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
$error = curl_error($ch);
curl_close($ch);
if ($error) {
throw new Exception("cURL error: " . $error);
}
if ($httpCode >= 400) {
throw new Exception("HTTP error: " . $httpCode);
}
return $response;
}
try {
$content = fetchWithTimeout('https://example.com', 20);
echo $content;
} catch (Exception $e) {
echo "Error: " . $e->getMessage();
}
?>
Go HTTP Client
package main
import (
"context"
"fmt"
"io"
"net/http"
"time"
)
func fetchWithTimeout(url string, timeout time.Duration) (string, error) {
// Create context with timeout
ctx, cancel := context.WithTimeout(context.Background(), timeout)
defer cancel()
// Create request with context
req, err := http.NewRequestWithContext(ctx, "GET", url, nil)
if err != nil {
return "", err
}
// Set headers
req.Header.Set("User-Agent", "Mozilla/5.0 (compatible; WebScraper/1.0)")
// Create client with transport timeouts
client := &http.Client{
Timeout: timeout,
Transport: &http.Transport{
DialTimeout: 10 * time.Second,
TLSHandshakeTimeout: 10 * time.Second,
ResponseHeaderTimeout: 10 * time.Second,
ExpectContinueTimeout: 1 * time.Second,
},
}
resp, err := client.Do(req)
if err != nil {
return "", err
}
defer resp.Body.Close()
body, err := io.ReadAll(resp.Body)
if err != nil {
return "", err
}
return string(body), nil
}
func main() {
content, err := fetchWithTimeout("https://example.com", 15*time.Second)
if err != nil {
fmt.Printf("Error: %v\n", err)
return
}
fmt.Println(content)
}
Best Practices for Timeout Configuration
1. Choose Appropriate Timeout Values
- Fast APIs: 5-10 seconds
- Standard web pages: 15-30 seconds
- Large file downloads: 60+ seconds
- Connection timeout: Usually 5-10 seconds
2. Implement Exponential Backoff
import time
import random
import requests
def fetch_with_retry(url, max_retries=3, base_delay=1):
for attempt in range(max_retries):
try:
response = requests.get(url, timeout=(5, 15))
return response
except requests.exceptions.Timeout:
if attempt == max_retries - 1:
raise
# Exponential backoff with jitter
delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
print(f"Timeout on attempt {attempt + 1}, retrying in {delay:.2f}s")
time.sleep(delay)
3. Different Timeouts for Different Scenarios
class WebScrapingClient:
def __init__(self):
self.session = requests.Session()
def quick_check(self, url):
"""Fast timeout for health checks"""
return self.session.get(url, timeout=(2, 5))
def standard_scrape(self, url):
"""Standard timeout for regular scraping"""
return self.session.get(url, timeout=(5, 15))
def large_download(self, url):
"""Extended timeout for large files"""
return self.session.get(url, timeout=(10, 120))
Integration with Web Scraping Tools
When working with browser automation tools, timeout configuration becomes even more critical. For comprehensive timeout handling in browser-based scraping, consider exploring how to handle timeouts in Puppeteer for advanced scenarios involving JavaScript rendering and dynamic content.
Additionally, when dealing with complex page interactions, understanding how to handle AJAX requests using Puppeteer can help you implement proper timeout strategies for asynchronous operations.
Monitoring and Logging Timeouts
import logging
import time
import requests
# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
def monitored_request(url, timeout=30):
start_time = time.time()
try:
response = requests.get(url, timeout=timeout)
duration = time.time() - start_time
logger.info(f"Request to {url} completed in {duration:.2f}s")
return response
except requests.exceptions.Timeout:
duration = time.time() - start_time
logger.warning(f"Request to {url} timed out after {duration:.2f}s")
raise
except Exception as e:
duration = time.time() - start_time
logger.error(f"Request to {url} failed after {duration:.2f}s: {e}")
raise
Conclusion
Proper timeout configuration is essential for reliable web scraping. Start with conservative timeout values and adjust based on your specific requirements and target websites. Always implement proper error handling and consider using retry mechanisms with exponential backoff for improved reliability. Remember that different types of requests may require different timeout strategies, so design your timeout configuration accordingly.
Regular monitoring and logging of timeout occurrences will help you optimize your timeout values and identify problematic endpoints that may require special handling or alternative approaches.