Table of contents

How do I set a timeout for requests to prevent hanging?

Setting proper timeouts for HTTP requests is crucial for building robust web scraping applications. Without timeouts, your requests can hang indefinitely, causing your application to freeze or consume excessive resources. This guide covers how to implement timeouts across different programming languages and libraries.

Understanding Request Timeouts

A timeout defines the maximum amount of time your application will wait for a response before giving up. There are typically two types of timeouts:

  • Connection timeout: Time to establish a connection to the server
  • Read timeout: Time to wait for data after the connection is established

Python Requests Library

The Python requests library provides several ways to set timeouts:

Basic Timeout

import requests
from requests.exceptions import Timeout, RequestException

try:
    # Set timeout to 10 seconds for both connection and read
    response = requests.get('https://example.com', timeout=10)
    print(response.status_code)
except Timeout:
    print("Request timed out")
except RequestException as e:
    print(f"Request failed: {e}")

Separate Connection and Read Timeouts

import requests

try:
    # Connection timeout: 5 seconds, Read timeout: 10 seconds
    response = requests.get(
        'https://example.com',
        timeout=(5, 10)
    )
    print(response.text)
except requests.exceptions.ConnectTimeout:
    print("Connection timeout occurred")
except requests.exceptions.ReadTimeout:
    print("Read timeout occurred")
except requests.exceptions.Timeout:
    print("Request timed out")

Session-Level Timeouts

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

# Create a session with default timeout
session = requests.Session()

# Configure retry strategy with timeout
retry_strategy = Retry(
    total=3,
    backoff_factor=1,
    status_forcelist=[429, 500, 502, 503, 504],
)

adapter = HTTPAdapter(max_retries=retry_strategy)
session.mount("http://", adapter)
session.mount("https://", adapter)

try:
    response = session.get('https://example.com', timeout=15)
    print(response.status_code)
except Exception as e:
    print(f"Request failed: {e}")
finally:
    session.close()

JavaScript Fetch API

Modern JavaScript provides timeout functionality through AbortController:

Basic Fetch with Timeout

async function fetchWithTimeout(url, timeout = 10000) {
    const controller = new AbortController();
    const timeoutId = setTimeout(() => controller.abort(), timeout);

    try {
        const response = await fetch(url, {
            signal: controller.signal,
            headers: {
                'User-Agent': 'Mozilla/5.0 (compatible; WebScraper/1.0)'
            }
        });

        clearTimeout(timeoutId);

        if (!response.ok) {
            throw new Error(`HTTP error! status: ${response.status}`);
        }

        return await response.text();
    } catch (error) {
        if (error.name === 'AbortError') {
            throw new Error('Request timed out');
        }
        throw error;
    }
}

// Usage
fetchWithTimeout('https://example.com', 5000)
    .then(data => console.log(data))
    .catch(error => console.error('Error:', error.message));

Node.js with Axios

const axios = require('axios');

// Create axios instance with default timeout
const client = axios.create({
    timeout: 10000, // 10 seconds
    headers: {
        'User-Agent': 'Mozilla/5.0 (compatible; WebScraper/1.0)'
    }
});

async function scrapeWithTimeout(url) {
    try {
        const response = await client.get(url, {
            timeout: 15000 // Override default timeout for this request
        });
        return response.data;
    } catch (error) {
        if (error.code === 'ECONNABORTED') {
            console.error('Request timed out');
        } else {
            console.error('Request failed:', error.message);
        }
        throw error;
    }
}

cURL Command Line

Set timeouts directly in cURL commands:

# Connection timeout: 10 seconds, Max time: 30 seconds
curl --connect-timeout 10 --max-time 30 https://example.com

# DNS resolution timeout
curl --dns-timeout 5 --connect-timeout 10 --max-time 30 https://example.com

# With retry on failure
curl --retry 3 --retry-delay 2 --connect-timeout 10 --max-time 30 https://example.com

PHP with cURL

<?php
function fetchWithTimeout($url, $timeout = 30) {
    $ch = curl_init();

    curl_setopt_array($ch, [
        CURLOPT_URL => $url,
        CURLOPT_RETURNTRANSFER => true,
        CURLOPT_FOLLOWLOCATION => true,
        CURLOPT_MAXREDIRS => 5,
        CURLOPT_TIMEOUT => $timeout,           // Total timeout
        CURLOPT_CONNECTTIMEOUT => 10,          // Connection timeout
        CURLOPT_DNS_CACHE_TIMEOUT => 120,      // DNS cache timeout
        CURLOPT_USERAGENT => 'Mozilla/5.0 (compatible; WebScraper/1.0)',
        CURLOPT_SSL_VERIFYPEER => true,
        CURLOPT_SSL_VERIFYHOST => 2,
    ]);

    $response = curl_exec($ch);
    $httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
    $error = curl_error($ch);

    curl_close($ch);

    if ($error) {
        throw new Exception("cURL error: " . $error);
    }

    if ($httpCode >= 400) {
        throw new Exception("HTTP error: " . $httpCode);
    }

    return $response;
}

try {
    $content = fetchWithTimeout('https://example.com', 20);
    echo $content;
} catch (Exception $e) {
    echo "Error: " . $e->getMessage();
}
?>

Go HTTP Client

package main

import (
    "context"
    "fmt"
    "io"
    "net/http"
    "time"
)

func fetchWithTimeout(url string, timeout time.Duration) (string, error) {
    // Create context with timeout
    ctx, cancel := context.WithTimeout(context.Background(), timeout)
    defer cancel()

    // Create request with context
    req, err := http.NewRequestWithContext(ctx, "GET", url, nil)
    if err != nil {
        return "", err
    }

    // Set headers
    req.Header.Set("User-Agent", "Mozilla/5.0 (compatible; WebScraper/1.0)")

    // Create client with transport timeouts
    client := &http.Client{
        Timeout: timeout,
        Transport: &http.Transport{
            DialTimeout:           10 * time.Second,
            TLSHandshakeTimeout:   10 * time.Second,
            ResponseHeaderTimeout: 10 * time.Second,
            ExpectContinueTimeout: 1 * time.Second,
        },
    }

    resp, err := client.Do(req)
    if err != nil {
        return "", err
    }
    defer resp.Body.Close()

    body, err := io.ReadAll(resp.Body)
    if err != nil {
        return "", err
    }

    return string(body), nil
}

func main() {
    content, err := fetchWithTimeout("https://example.com", 15*time.Second)
    if err != nil {
        fmt.Printf("Error: %v\n", err)
        return
    }

    fmt.Println(content)
}

Best Practices for Timeout Configuration

1. Choose Appropriate Timeout Values

  • Fast APIs: 5-10 seconds
  • Standard web pages: 15-30 seconds
  • Large file downloads: 60+ seconds
  • Connection timeout: Usually 5-10 seconds

2. Implement Exponential Backoff

import time
import random
import requests

def fetch_with_retry(url, max_retries=3, base_delay=1):
    for attempt in range(max_retries):
        try:
            response = requests.get(url, timeout=(5, 15))
            return response
        except requests.exceptions.Timeout:
            if attempt == max_retries - 1:
                raise

            # Exponential backoff with jitter
            delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
            print(f"Timeout on attempt {attempt + 1}, retrying in {delay:.2f}s")
            time.sleep(delay)

3. Different Timeouts for Different Scenarios

class WebScrapingClient:
    def __init__(self):
        self.session = requests.Session()

    def quick_check(self, url):
        """Fast timeout for health checks"""
        return self.session.get(url, timeout=(2, 5))

    def standard_scrape(self, url):
        """Standard timeout for regular scraping"""
        return self.session.get(url, timeout=(5, 15))

    def large_download(self, url):
        """Extended timeout for large files"""
        return self.session.get(url, timeout=(10, 120))

Integration with Web Scraping Tools

When working with browser automation tools, timeout configuration becomes even more critical. For comprehensive timeout handling in browser-based scraping, consider exploring how to handle timeouts in Puppeteer for advanced scenarios involving JavaScript rendering and dynamic content.

Additionally, when dealing with complex page interactions, understanding how to handle AJAX requests using Puppeteer can help you implement proper timeout strategies for asynchronous operations.

Monitoring and Logging Timeouts

import logging
import time
import requests

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def monitored_request(url, timeout=30):
    start_time = time.time()

    try:
        response = requests.get(url, timeout=timeout)
        duration = time.time() - start_time

        logger.info(f"Request to {url} completed in {duration:.2f}s")
        return response

    except requests.exceptions.Timeout:
        duration = time.time() - start_time
        logger.warning(f"Request to {url} timed out after {duration:.2f}s")
        raise
    except Exception as e:
        duration = time.time() - start_time
        logger.error(f"Request to {url} failed after {duration:.2f}s: {e}")
        raise

Conclusion

Proper timeout configuration is essential for reliable web scraping. Start with conservative timeout values and adjust based on your specific requirements and target websites. Always implement proper error handling and consider using retry mechanisms with exponential backoff for improved reliability. Remember that different types of requests may require different timeout strategies, so design your timeout configuration accordingly.

Regular monitoring and logging of timeout occurrences will help you optimize your timeout values and identify problematic endpoints that may require special handling or alternative approaches.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon