How do I handle timeouts and connection errors in Python web scraping?
Timeouts and connection errors are common challenges in web scraping that can cause your scripts to hang indefinitely or crash unexpectedly. Python provides several built-in and third-party solutions to handle these issues gracefully, ensuring your scraping operations are robust and reliable.
Understanding Common Timeout and Connection Issues
Before implementing solutions, it's important to understand the types of errors you might encounter:
- Connection timeouts: When the initial connection to the server takes too long
- Read timeouts: When the server doesn't respond within a specified time after connection
- DNS resolution failures: When domain names can't be resolved to IP addresses
- Connection refused: When the target server actively refuses connections
- SSL/TLS errors: When secure connection establishment fails
- Network unreachable: When there's no route to the target host
Using the requests Library with Timeout Handling
The requests
library is the most popular choice for HTTP operations in Python. Here's how to implement comprehensive timeout and error handling:
Basic Timeout Configuration
import requests
from requests.exceptions import RequestException, Timeout, ConnectionError
import time
def fetch_with_timeout(url, timeout=10):
"""
Fetch URL with timeout handling
"""
try:
response = requests.get(url, timeout=timeout)
response.raise_for_status() # Raises HTTPError for bad responses
return response
except Timeout:
print(f"Timeout occurred for {url}")
return None
except ConnectionError:
print(f"Connection error occurred for {url}")
return None
except RequestException as e:
print(f"Request failed for {url}: {e}")
return None
# Usage
url = "https://example.com"
response = fetch_with_timeout(url, timeout=15)
if response:
print(response.text)
Advanced Timeout Configuration
You can specify separate timeouts for connection and read operations:
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
def create_session_with_retries():
"""
Create a requests session with retry strategy and timeout configuration
"""
session = requests.Session()
# Configure retry strategy
retry_strategy = Retry(
total=3, # Total number of retries
status_forcelist=[429, 500, 502, 503, 504], # HTTP status codes to retry
method_whitelist=["HEAD", "GET", "OPTIONS"], # HTTP methods to retry
backoff_factor=1 # Backoff factor for retry delays
)
# Mount adapter with retry strategy
adapter = HTTPAdapter(max_retries=retry_strategy)
session.mount("http://", adapter)
session.mount("https://", adapter)
return session
def robust_fetch(url, max_retries=3):
"""
Fetch URL with comprehensive error handling and retries
"""
session = create_session_with_retries()
for attempt in range(max_retries):
try:
# Separate connection and read timeouts
response = session.get(
url,
timeout=(5, 10), # (connection_timeout, read_timeout)
headers={
'User-Agent': 'Mozilla/5.0 (compatible; WebScraper/1.0)'
}
)
response.raise_for_status()
return response
except requests.exceptions.ConnectTimeout:
print(f"Connection timeout on attempt {attempt + 1}")
except requests.exceptions.ReadTimeout:
print(f"Read timeout on attempt {attempt + 1}")
except requests.exceptions.ConnectionError as e:
print(f"Connection error on attempt {attempt + 1}: {e}")
except requests.exceptions.HTTPError as e:
print(f"HTTP error on attempt {attempt + 1}: {e}")
break # Don't retry on HTTP errors like 404, 403
except Exception as e:
print(f"Unexpected error on attempt {attempt + 1}: {e}")
if attempt < max_retries - 1:
wait_time = 2 ** attempt # Exponential backoff
print(f"Retrying in {wait_time} seconds...")
time.sleep(wait_time)
return None
# Usage
response = robust_fetch("https://example.com")
if response:
print("Successfully fetched data")
print(response.text)
Handling Timeouts with urllib3
For more granular control, you can use urllib3
directly:
import urllib3
from urllib3.exceptions import TimeoutError, NewConnectionError, MaxRetryError
import time
def fetch_with_urllib3(url, retries=3):
"""
Fetch URL using urllib3 with timeout and retry handling
"""
http = urllib3.PoolManager(
timeout=urllib3.Timeout(connect=5.0, read=10.0),
retries=urllib3.Retry(
total=retries,
connect=retries,
read=retries,
status=retries,
status_forcelist=[500, 502, 503, 504],
backoff_factor=0.3
)
)
try:
response = http.request('GET', url)
return response
except TimeoutError:
print(f"Timeout error for {url}")
return None
except NewConnectionError:
print(f"Connection error for {url}")
return None
except MaxRetryError as e:
print(f"Max retries exceeded for {url}: {e}")
return None
except Exception as e:
print(f"Unexpected error for {url}: {e}")
return None
# Usage
response = fetch_with_urllib3("https://httpbin.org/delay/2")
if response:
print(f"Status: {response.status}")
print(response.data.decode('utf-8'))
Asynchronous Web Scraping with aiohttp
For high-performance scraping, asynchronous programming with aiohttp
provides excellent timeout handling:
import asyncio
import aiohttp
from aiohttp import ClientTimeout, ClientError
import time
async def fetch_async(session, url, semaphore):
"""
Asynchronous fetch with timeout and error handling
"""
async with semaphore: # Limit concurrent connections
try:
async with session.get(url) as response:
await response.raise_for_status()
return await response.text()
except asyncio.TimeoutError:
print(f"Timeout for {url}")
return None
except aiohttp.ClientConnectionError:
print(f"Connection error for {url}")
return None
except aiohttp.ClientResponseError as e:
print(f"HTTP error for {url}: {e}")
return None
except Exception as e:
print(f"Unexpected error for {url}: {e}")
return None
async def scrape_multiple_urls(urls, max_concurrent=10):
"""
Scrape multiple URLs concurrently with timeout handling
"""
# Configure timeout settings
timeout = ClientTimeout(total=30, connect=5)
semaphore = asyncio.Semaphore(max_concurrent)
connector = aiohttp.TCPConnector(
limit=100, # Total connection pool size
limit_per_host=10, # Connections per host
ttl_dns_cache=300, # DNS cache TTL
use_dns_cache=True
)
async with aiohttp.ClientSession(
timeout=timeout,
connector=connector,
headers={'User-Agent': 'AsyncScraper/1.0'}
) as session:
tasks = [fetch_async(session, url, semaphore) for url in urls]
results = await asyncio.gather(*tasks, return_exceptions=True)
return results
# Usage
async def main():
urls = [
"https://httpbin.org/delay/1",
"https://httpbin.org/delay/2",
"https://httpbin.org/status/500",
"https://httpbin.org/delay/3"
]
results = await scrape_multiple_urls(urls)
for i, result in enumerate(results):
if isinstance(result, Exception):
print(f"URL {i} failed: {result}")
elif result:
print(f"URL {i} succeeded: {len(result)} characters")
# Run the async function
if __name__ == "__main__":
asyncio.run(main())
Implementing Circuit Breaker Pattern
For production systems, consider implementing a circuit breaker pattern to prevent cascading failures:
import time
from enum import Enum
from typing import Callable, Any
import requests
class CircuitState(Enum):
CLOSED = "closed"
OPEN = "open"
HALF_OPEN = "half_open"
class CircuitBreaker:
"""
Circuit breaker implementation for web scraping
"""
def __init__(self, failure_threshold=5, recovery_timeout=60, expected_exception=Exception):
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.expected_exception = expected_exception
self.failure_count = 0
self.last_failure_time = None
self.state = CircuitState.CLOSED
def call(self, func: Callable, *args, **kwargs) -> Any:
"""
Execute function with circuit breaker logic
"""
if self.state == CircuitState.OPEN:
if self._should_attempt_reset():
self.state = CircuitState.HALF_OPEN
else:
raise Exception("Circuit breaker is OPEN")
try:
result = func(*args, **kwargs)
self._on_success()
return result
except self.expected_exception as e:
self._on_failure()
raise e
def _should_attempt_reset(self) -> bool:
return (time.time() - self.last_failure_time) >= self.recovery_timeout
def _on_success(self):
self.failure_count = 0
self.state = CircuitState.CLOSED
def _on_failure(self):
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = CircuitState.OPEN
# Usage with circuit breaker
def scrape_with_circuit_breaker(url):
"""
Scraping function with circuit breaker protection
"""
def _fetch():
response = requests.get(url, timeout=10)
response.raise_for_status()
return response.text
circuit_breaker = CircuitBreaker(
failure_threshold=3,
recovery_timeout=30,
expected_exception=requests.RequestException
)
try:
return circuit_breaker.call(_fetch)
except Exception as e:
print(f"Circuit breaker prevented call or request failed: {e}")
return None
Best Practices for Production Systems
1. Implement Exponential Backoff
import random
import time
def exponential_backoff_retry(func, max_retries=3, base_delay=1, max_delay=60):
"""
Retry function with exponential backoff and jitter
"""
for attempt in range(max_retries):
try:
return func()
except Exception as e:
if attempt == max_retries - 1:
raise e
# Calculate delay with exponential backoff and jitter
delay = min(base_delay * (2 ** attempt), max_delay)
jittered_delay = delay * (0.5 + random.random() * 0.5)
print(f"Attempt {attempt + 1} failed, retrying in {jittered_delay:.2f}s")
time.sleep(jittered_delay)
2. Monitor and Log Errors
import logging
from functools import wraps
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
def log_errors(func):
"""
Decorator to log errors and timeouts
"""
@wraps(func)
def wrapper(*args, **kwargs):
try:
return func(*args, **kwargs)
except requests.exceptions.Timeout as e:
logger.error(f"Timeout in {func.__name__}: {e}")
raise
except requests.exceptions.ConnectionError as e:
logger.error(f"Connection error in {func.__name__}: {e}")
raise
except Exception as e:
logger.error(f"Unexpected error in {func.__name__}: {e}")
raise
return wrapper
@log_errors
def fetch_data(url):
response = requests.get(url, timeout=10)
return response.text
Alternative Solutions
For complex scenarios, consider using specialized tools that handle timeouts automatically. Similar to how to handle timeouts in Puppeteer for JavaScript-based scraping, Python developers can leverage selenium with proper timeout configurations for JavaScript-heavy sites.
When dealing with dynamic content that requires waiting for specific elements, implementing proper retry logic for failed requests becomes crucial for maintaining scraping reliability.
Conclusion
Handling timeouts and connection errors effectively is crucial for building robust web scraping applications. The key strategies include:
- Set appropriate timeouts for both connection and read operations
- Implement retry logic with exponential backoff
- Use circuit breakers to prevent cascading failures
- Handle specific exceptions rather than catching all errors
- Monitor and log errors for debugging and optimization
- Consider asynchronous approaches for high-performance requirements
By implementing these patterns, your Python web scraping applications will be more resilient to network issues and provide better reliability in production environments. Remember to always respect rate limits and robots.txt files when implementing retry mechanisms to maintain ethical scraping practices.