What are the common HTTP status codes you encounter when scraping APIs?
When scraping APIs, understanding HTTP status codes is crucial for building robust and reliable scrapers. These three-digit codes provide immediate feedback about the success or failure of your requests, enabling you to implement appropriate error handling and retry logic. This comprehensive guide covers the most common HTTP status codes you'll encounter during API scraping and how to handle them effectively.
Understanding HTTP Status Code Categories
HTTP status codes are organized into five categories, each indicating a different type of response:
- 1xx (Informational): Request received, continuing process
- 2xx (Success): Request successfully received, understood, and accepted
- 3xx (Redirection): Further action must be taken to complete the request
- 4xx (Client Error): Request contains bad syntax or cannot be fulfilled
- 5xx (Server Error): Server failed to fulfill an apparently valid request
Common Success Status Codes (2xx)
200 OK
The most common success code, indicating that the request was successful and the server returned the requested data.
import requests
response = requests.get('https://api.example.com/data')
if response.status_code == 200:
data = response.json()
print("Data retrieved successfully:", data)
fetch('https://api.example.com/data')
.then(response => {
if (response.status === 200) {
return response.json();
}
throw new Error(`HTTP ${response.status}`);
})
.then(data => console.log('Data retrieved:', data));
201 Created
Indicates that a new resource was successfully created, typically returned after POST requests.
import requests
data = {'name': 'New Item', 'value': 123}
response = requests.post('https://api.example.com/items', json=data)
if response.status_code == 201:
print("Resource created successfully")
new_item = response.json()
204 No Content
The request was successful, but there's no content to return. Common with DELETE operations or updates that don't return data.
response = requests.delete('https://api.example.com/items/123')
if response.status_code == 204:
print("Item deleted successfully")
Redirection Status Codes (3xx)
301 Moved Permanently
The resource has been permanently moved to a new location. Most HTTP libraries handle this automatically.
import requests
# requests library follows redirects automatically
response = requests.get('https://api.example.com/old-endpoint')
print(f"Final URL: {response.url}")
print(f"Redirect history: {[r.url for r in response.history]}")
302 Found (Temporary Redirect)
Similar to 301, but indicates a temporary move. Handle the same way as 301.
304 Not Modified
Used with conditional requests to indicate cached content is still valid. Implement caching to leverage this:
import requests
headers = {'If-None-Match': 'stored-etag-value'}
response = requests.get('https://api.example.com/data', headers=headers)
if response.status_code == 304:
print("Use cached data")
else:
# Process new data and store new ETag
new_etag = response.headers.get('ETag')
Client Error Status Codes (4xx)
400 Bad Request
The request syntax is malformed or invalid. Check your request parameters, headers, and body format.
import requests
try:
response = requests.post('https://api.example.com/data', json={'invalid': 'data'})
if response.status_code == 400:
error_details = response.json()
print(f"Bad request: {error_details}")
# Fix the request based on error details
except requests.exceptions.RequestException as e:
print(f"Request failed: {e}")
401 Unauthorized
Authentication is required or has failed. Implement proper authentication handling:
import requests
from requests.auth import HTTPBasicAuth
# Basic authentication
response = requests.get(
'https://api.example.com/protected',
auth=HTTPBasicAuth('username', 'password')
)
# Token-based authentication
headers = {'Authorization': 'Bearer your-access-token'}
response = requests.get('https://api.example.com/protected', headers=headers)
if response.status_code == 401:
print("Authentication failed - refresh token or check credentials")
403 Forbidden
The server understands the request but refuses to authorize it. This often indicates insufficient permissions or blocked access.
def handle_forbidden_response(response):
if response.status_code == 403:
print("Access forbidden - check API permissions or rate limits")
# Implement backoff strategy or alternative approach
return False
return True
404 Not Found
The requested resource doesn't exist. Common when scraping APIs with dynamic endpoints.
import requests
def safe_api_request(url):
try:
response = requests.get(url)
if response.status_code == 404:
print(f"Resource not found: {url}")
return None
response.raise_for_status()
return response.json()
except requests.exceptions.RequestException as e:
print(f"Request error: {e}")
return None
429 Too Many Requests
Rate limiting is in effect. Implement exponential backoff and respect rate limits:
import time
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
def create_session_with_retries():
session = requests.Session()
# Configure retry strategy
retry_strategy = Retry(
total=3,
status_forcelist=[429, 500, 502, 503, 504],
backoff_factor=1,
respect_retry_after_header=True
)
adapter = HTTPAdapter(max_retries=retry_strategy)
session.mount("http://", adapter)
session.mount("https://", adapter)
return session
# Usage
session = create_session_with_retries()
response = session.get('https://api.example.com/data')
async function fetchWithRetry(url, options = {}, maxRetries = 3) {
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
const response = await fetch(url, options);
if (response.status === 429) {
const retryAfter = response.headers.get('Retry-After');
const delay = retryAfter ? parseInt(retryAfter) * 1000 : Math.pow(2, attempt) * 1000;
console.log(`Rate limited. Waiting ${delay}ms before retry ${attempt}`);
await new Promise(resolve => setTimeout(resolve, delay));
continue;
}
return response;
} catch (error) {
if (attempt === maxRetries) throw error;
await new Promise(resolve => setTimeout(resolve, Math.pow(2, attempt) * 1000));
}
}
}
Server Error Status Codes (5xx)
500 Internal Server Error
Generic server error. Implement retry logic with exponential backoff:
import time
import random
def exponential_backoff_retry(func, max_retries=3, base_delay=1):
for attempt in range(max_retries):
try:
response = func()
if response.status_code < 500:
return response
except Exception as e:
if attempt == max_retries - 1:
raise e
# Exponential backoff with jitter
delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
time.sleep(delay)
return None
502 Bad Gateway / 503 Service Unavailable / 504 Gateway Timeout
These indicate temporary server issues. Handle them with retry logic similar to 500 errors.
Comprehensive Error Handling Strategy
Here's a complete example that handles multiple status codes appropriately:
import requests
import time
import logging
from typing import Optional, Dict, Any
class APIClient:
def __init__(self, base_url: str, api_key: str = None):
self.base_url = base_url
self.session = requests.Session()
if api_key:
self.session.headers.update({'Authorization': f'Bearer {api_key}'})
def make_request(self, endpoint: str, method: str = 'GET', **kwargs) -> Optional[Dict[Any, Any]]:
url = f"{self.base_url}/{endpoint.lstrip('/')}"
max_retries = 3
for attempt in range(max_retries):
try:
response = self.session.request(method, url, **kwargs)
if response.status_code == 200:
return response.json()
elif response.status_code == 201:
logging.info("Resource created successfully")
return response.json() if response.content else {}
elif response.status_code == 204:
logging.info("Operation completed successfully")
return {}
elif response.status_code == 401:
logging.error("Authentication failed")
raise AuthenticationError("Invalid credentials")
elif response.status_code == 403:
logging.error("Access forbidden")
raise PermissionError("Insufficient permissions")
elif response.status_code == 404:
logging.warning(f"Resource not found: {url}")
return None
elif response.status_code == 429:
retry_after = int(response.headers.get('Retry-After', 60))
logging.warning(f"Rate limited. Waiting {retry_after} seconds")
time.sleep(retry_after)
continue
elif 500 <= response.status_code < 600:
if attempt < max_retries - 1:
delay = 2 ** attempt
logging.warning(f"Server error {response.status_code}. Retrying in {delay}s")
time.sleep(delay)
continue
else:
logging.error(f"Server error {response.status_code} after {max_retries} attempts")
raise ServerError(f"Server error: {response.status_code}")
else:
logging.error(f"Unexpected status code: {response.status_code}")
response.raise_for_status()
except requests.exceptions.RequestException as e:
if attempt < max_retries - 1:
time.sleep(2 ** attempt)
continue
raise NetworkError(f"Network error: {e}")
return None
# Custom exceptions
class AuthenticationError(Exception):
pass
class ServerError(Exception):
pass
class NetworkError(Exception):
pass
Testing Status Code Handling
When developing scrapers, it's important to test how your code handles different status codes:
import unittest
from unittest.mock import Mock, patch
class TestAPIClient(unittest.TestCase):
def setUp(self):
self.client = APIClient('https://api.example.com', 'test-key')
@patch('requests.Session.request')
def test_429_retry_logic(self, mock_request):
# Mock 429 response then success
mock_429 = Mock()
mock_429.status_code = 429
mock_429.headers = {'Retry-After': '1'}
mock_200 = Mock()
mock_200.status_code = 200
mock_200.json.return_value = {'data': 'success'}
mock_request.side_effect = [mock_429, mock_200]
result = self.client.make_request('/test')
self.assertEqual(result, {'data': 'success'})
self.assertEqual(mock_request.call_count, 2)
Best Practices for Status Code Handling
- Always check status codes explicitly rather than assuming success
- Implement appropriate retry logic for temporary failures (5xx, 429)
- Respect rate limits by implementing exponential backoff
- Log status codes and errors for debugging and monitoring
- Handle authentication errors by refreshing tokens when possible
- Use libraries that support automatic retries when available
- Test your error handling code with mocked responses
Understanding and properly handling HTTP status codes is essential for building reliable API scrapers. When dealing with complex scraping scenarios involving dynamic content, you might also want to explore how to handle AJAX requests using Puppeteer for JavaScript-heavy applications, or learn about handling timeouts in Puppeteer when working with slower-responding APIs.
By implementing comprehensive status code handling, you'll create more robust scrapers that can gracefully handle various API responses and continue operating reliably even when encountering errors or temporary issues.