What are the most important HTTP status codes for web scraping?
HTTP status codes are essential indicators that tell you whether your web scraping request was successful or encountered an issue. Understanding these codes is crucial for building robust scrapers that can handle various server responses appropriately. This guide covers the most important HTTP status codes you'll encounter during web scraping and how to handle them effectively.
Understanding HTTP Status Code Categories
HTTP status codes are organized into five categories, each serving a specific purpose:
- 1xx (Informational): Request received, continuing process
- 2xx (Success): Request was successfully received, understood, and accepted
- 3xx (Redirection): Further action needs to be taken to complete the request
- 4xx (Client Error): Request contains bad syntax or cannot be fulfilled
- 5xx (Server Error): Server failed to fulfill an apparently valid request
Most Important Success Codes (2xx)
200 OK
The most common and important status code for web scraping. It indicates that the request was successful and the server has returned the requested content.
import requests
response = requests.get('https://example.com')
if response.status_code == 200:
print("Success! Content retrieved:")
print(response.text)
else:
print(f"Request failed with status code: {response.status_code}")
// Using fetch API
fetch('https://example.com')
.then(response => {
if (response.status === 200) {
console.log('Success! Content retrieved');
return response.text();
} else {
console.log(`Request failed with status code: ${response.status}`);
}
})
.then(content => console.log(content));
201 Created
Indicates that a new resource has been successfully created. This is common when scraping APIs that accept POST requests for data submission.
204 No Content
The request was successful, but there's no content to return. This might occur when scraping endpoints that perform actions without returning data.
Critical Redirection Codes (3xx)
301 Moved Permanently
The requested resource has been permanently moved to a new URL. Your scraper should update its references to use the new URL.
import requests
# Requests automatically follows redirects by default
response = requests.get('https://example.com/old-page')
print(f"Final URL: {response.url}")
print(f"Status code: {response.status_code}")
# To handle redirects manually
response = requests.get('https://example.com/old-page', allow_redirects=False)
if response.status_code == 301:
new_url = response.headers['Location']
print(f"Page permanently moved to: {new_url}")
302 Found (Temporary Redirect)
The resource is temporarily available at a different URL. Unlike 301, you shouldn't update your permanent references.
304 Not Modified
Used with conditional requests. The resource hasn't changed since the last request, so the cached version can be used.
import requests
headers = {
'If-Modified-Since': 'Wed, 21 Oct 2023 07:28:00 GMT'
}
response = requests.get('https://example.com', headers=headers)
if response.status_code == 304:
print("Content not modified, use cached version")
Essential Client Error Codes (4xx)
400 Bad Request
The server cannot process the request due to invalid syntax. Check your request parameters, headers, and data format.
import requests
try:
# Example of a potentially malformed request
response = requests.post('https://api.example.com/data',
json={'invalid': 'data'})
if response.status_code == 400:
print("Bad request - check your data format")
print(response.text) # Often contains error details
except requests.exceptions.RequestException as e:
print(f"Request failed: {e}")
401 Unauthorized
Authentication is required or has failed. You need to provide valid credentials.
import requests
# Example with basic authentication
response = requests.get('https://api.example.com/protected',
auth=('username', 'password'))
if response.status_code == 401:
print("Authentication failed - check credentials")
403 Forbidden
The server understood the request but refuses to authorize it. This often indicates: - IP blocking - Rate limiting - Insufficient permissions - Anti-bot measures
// Handling 403 with retry logic
async function scrapeWithRetry(url, maxRetries = 3) {
for (let i = 0; i < maxRetries; i++) {
try {
const response = await fetch(url);
if (response.status === 403) {
console.log(`Access forbidden (attempt ${i + 1})`);
if (i < maxRetries - 1) {
// Wait before retry (exponential backoff)
await new Promise(resolve =>
setTimeout(resolve, Math.pow(2, i) * 1000));
continue;
}
}
return response;
} catch (error) {
console.error('Request failed:', error);
}
}
throw new Error('Max retries exceeded');
}
404 Not Found
The requested resource doesn't exist. This is important for scrapers that crawl multiple pages.
import requests
def safe_scrape(url):
try:
response = requests.get(url)
if response.status_code == 404:
print(f"Page not found: {url}")
return None
elif response.status_code == 200:
return response.text
except requests.exceptions.RequestException as e:
print(f"Error scraping {url}: {e}")
return None
429 Too Many Requests
Rate limiting is in effect. You're making requests too quickly and need to slow down.
import requests
import time
def scrape_with_rate_limit(urls):
for url in urls:
response = requests.get(url)
if response.status_code == 429:
retry_after = int(response.headers.get('Retry-After', 60))
print(f"Rate limited. Waiting {retry_after} seconds...")
time.sleep(retry_after)
response = requests.get(url) # Retry after waiting
if response.status_code == 200:
yield response.text
# Add delay between requests to avoid rate limiting
time.sleep(1)
Important Server Error Codes (5xx)
500 Internal Server Error
The server encountered an unexpected condition. This is often temporary, so implementing retry logic is recommended.
502 Bad Gateway
The server received an invalid response from an upstream server. Common with load balancers and proxy servers.
503 Service Unavailable
The server is temporarily unable to handle requests, often due to maintenance or overload.
import requests
import time
def robust_scrape(url, max_retries=3):
retry_codes = [500, 502, 503, 504]
for attempt in range(max_retries):
try:
response = requests.get(url, timeout=10)
if response.status_code == 200:
return response.text
elif response.status_code in retry_codes:
wait_time = 2 ** attempt # Exponential backoff
print(f"Server error {response.status_code}. "
f"Retrying in {wait_time} seconds...")
time.sleep(wait_time)
else:
print(f"Non-retryable error: {response.status_code}")
break
except requests.exceptions.RequestException as e:
print(f"Request exception: {e}")
return None
Advanced Status Code Handling Strategies
Implementing Comprehensive Error Handling
import requests
from enum import Enum
class ScrapingResult(Enum):
SUCCESS = "success"
TEMPORARY_ERROR = "temporary_error"
PERMANENT_ERROR = "permanent_error"
RATE_LIMITED = "rate_limited"
def categorize_response(status_code):
if 200 <= status_code < 300:
return ScrapingResult.SUCCESS
elif status_code in [429, 500, 502, 503, 504]:
return ScrapingResult.TEMPORARY_ERROR
elif status_code == 429:
return ScrapingResult.RATE_LIMITED
else:
return ScrapingResult.PERMANENT_ERROR
def advanced_scraper(url):
response = requests.get(url)
result = categorize_response(response.status_code)
if result == ScrapingResult.SUCCESS:
return response.text
elif result == ScrapingResult.RATE_LIMITED:
# Implement exponential backoff
return handle_rate_limit(url, response)
elif result == ScrapingResult.TEMPORARY_ERROR:
# Retry with backoff
return retry_request(url)
else:
# Log permanent error and skip
print(f"Permanent error {response.status_code} for {url}")
return None
Monitoring Status Codes in Production
from collections import defaultdict
import logging
class StatusCodeMonitor:
def __init__(self):
self.status_counts = defaultdict(int)
self.logger = logging.getLogger(__name__)
def record_status(self, status_code, url):
self.status_counts[status_code] += 1
if status_code >= 400:
self.logger.warning(f"HTTP {status_code} for {url}")
def get_statistics(self):
total_requests = sum(self.status_counts.values())
stats = {}
for status, count in self.status_counts.items():
percentage = (count / total_requests) * 100
stats[status] = {
'count': count,
'percentage': round(percentage, 2)
}
return stats
Best Practices for Status Code Management
1. Always Check Status Codes
Never assume a request was successful without checking the status code.
2. Implement Appropriate Retry Logic
Retry temporary errors (5xx, 429) but not permanent ones (4xx except 429).
3. Respect Rate Limits
When you encounter 429, honor the Retry-After
header if present.
4. Log Status Codes
Keep track of status code patterns to identify issues with your scraping targets. When handling errors in web automation tools, proper status code handling becomes even more critical for maintaining reliable scrapers.
5. Handle Redirects Appropriately
Decide whether to follow redirects automatically or handle them manually based on your use case.
Testing Status Code Handling
# Using curl to test different status codes
curl -I https://httpstat.us/200 # Test 200 OK
curl -I https://httpstat.us/404 # Test 404 Not Found
curl -I https://httpstat.us/500 # Test 500 Internal Server Error
Conclusion
Understanding HTTP status codes is fundamental to building reliable web scrapers. The key is to implement appropriate handling for each category of status codes: celebrate success codes, follow redirects intelligently, retry temporary errors, and gracefully handle permanent failures. When combined with proper timeout handling and error management, comprehensive status code handling ensures your web scraping operations remain robust and efficient.
Remember that different websites may use status codes differently, so always test your scrapers against your target sites and monitor status code patterns in production to identify potential issues early.