What are the most important HTTP status codes for web scraping?

HTTP status codes are essential indicators that tell you whether your web scraping request was successful or encountered an issue. Understanding these codes is crucial for building robust scrapers that can handle various server responses appropriately. This guide covers the most important HTTP status codes you'll encounter during web scraping and how to handle them effectively.

Understanding HTTP Status Code Categories

HTTP status codes are organized into five categories, each serving a specific purpose:

1xx (Informational): Request received, continuing process
2xx (Success): Request was successfully received, understood, and accepted
3xx (Redirection): Further action needs to be taken to complete the request
4xx (Client Error): Request contains bad syntax or cannot be fulfilled
5xx (Server Error): Server failed to fulfill an apparently valid request

Most Important Success Codes (2xx)

200 OK

The most common and important status code for web scraping. It indicates that the request was successful and the server has returned the requested content.

import requests

response = requests.get('https://example.com')
if response.status_code == 200:
    print("Success! Content retrieved:")
    print(response.text)
else:
    print(f"Request failed with status code: {response.status_code}")

// Using fetch API
fetch('https://example.com')
    .then(response => {
        if (response.status === 200) {
            console.log('Success! Content retrieved');
            return response.text();
        } else {
            console.log(`Request failed with status code: ${response.status}`);
        }
    })
    .then(content => console.log(content));

201 Created

Indicates that a new resource has been successfully created. This is common when scraping APIs that accept POST requests for data submission.

204 No Content

The request was successful, but there's no content to return. This might occur when scraping endpoints that perform actions without returning data.

Critical Redirection Codes (3xx)

301 Moved Permanently

The requested resource has been permanently moved to a new URL. Your scraper should update its references to use the new URL.

import requests

# Requests automatically follows redirects by default
response = requests.get('https://example.com/old-page')
print(f"Final URL: {response.url}")
print(f"Status code: {response.status_code}")

# To handle redirects manually
response = requests.get('https://example.com/old-page', allow_redirects=False)
if response.status_code == 301:
    new_url = response.headers['Location']
    print(f"Page permanently moved to: {new_url}")

302 Found (Temporary Redirect)

The resource is temporarily available at a different URL. Unlike 301, you shouldn't update your permanent references.

304 Not Modified

Used with conditional requests. The resource hasn't changed since the last request, so the cached version can be used.

import requests

headers = {
    'If-Modified-Since': 'Wed, 21 Oct 2023 07:28:00 GMT'
}
response = requests.get('https://example.com', headers=headers)
if response.status_code == 304:
    print("Content not modified, use cached version")

Essential Client Error Codes (4xx)

400 Bad Request

The server cannot process the request due to invalid syntax. Check your request parameters, headers, and data format.

import requests

try:
    # Example of a potentially malformed request
    response = requests.post('https://api.example.com/data', 
                           json={'invalid': 'data'})
    if response.status_code == 400:
        print("Bad request - check your data format")
        print(response.text)  # Often contains error details
except requests.exceptions.RequestException as e:
    print(f"Request failed: {e}")

401 Unauthorized

Authentication is required or has failed. You need to provide valid credentials.

import requests

# Example with basic authentication
response = requests.get('https://api.example.com/protected', 
                       auth=('username', 'password'))
if response.status_code == 401:
    print("Authentication failed - check credentials")

403 Forbidden

The server understood the request but refuses to authorize it. This often indicates: - IP blocking - Rate limiting - Insufficient permissions - Anti-bot measures

// Handling 403 with retry logic
async function scrapeWithRetry(url, maxRetries = 3) {
    for (let i = 0; i < maxRetries; i++) {
        try {
            const response = await fetch(url);

            if (response.status === 403) {
                console.log(`Access forbidden (attempt ${i + 1})`);
                if (i < maxRetries - 1) {
                    // Wait before retry (exponential backoff)
                    await new Promise(resolve => 
                        setTimeout(resolve, Math.pow(2, i) * 1000));
                    continue;
                }
            }

            return response;
        } catch (error) {
            console.error('Request failed:', error);
        }
    }
    throw new Error('Max retries exceeded');
}

404 Not Found

The requested resource doesn't exist. This is important for scrapers that crawl multiple pages.

import requests

def safe_scrape(url):
    try:
        response = requests.get(url)
        if response.status_code == 404:
            print(f"Page not found: {url}")
            return None
        elif response.status_code == 200:
            return response.text
    except requests.exceptions.RequestException as e:
        print(f"Error scraping {url}: {e}")
        return None

429 Too Many Requests

Rate limiting is in effect. You're making requests too quickly and need to slow down.

import requests
import time

def scrape_with_rate_limit(urls):
    for url in urls:
        response = requests.get(url)

        if response.status_code == 429:
            retry_after = int(response.headers.get('Retry-After', 60))
            print(f"Rate limited. Waiting {retry_after} seconds...")
            time.sleep(retry_after)
            response = requests.get(url)  # Retry after waiting

        if response.status_code == 200:
            yield response.text

        # Add delay between requests to avoid rate limiting
        time.sleep(1)

Important Server Error Codes (5xx)

500 Internal Server Error

The server encountered an unexpected condition. This is often temporary, so implementing retry logic is recommended.

502 Bad Gateway

The server received an invalid response from an upstream server. Common with load balancers and proxy servers.

503 Service Unavailable

The server is temporarily unable to handle requests, often due to maintenance or overload.

import requests
import time

def robust_scrape(url, max_retries=3):
    retry_codes = [500, 502, 503, 504]

    for attempt in range(max_retries):
        try:
            response = requests.get(url, timeout=10)

            if response.status_code == 200:
                return response.text
            elif response.status_code in retry_codes:
                wait_time = 2 ** attempt  # Exponential backoff
                print(f"Server error {response.status_code}. "
                      f"Retrying in {wait_time} seconds...")
                time.sleep(wait_time)
            else:
                print(f"Non-retryable error: {response.status_code}")
                break

        except requests.exceptions.RequestException as e:
            print(f"Request exception: {e}")

    return None

Advanced Status Code Handling Strategies

Implementing Comprehensive Error Handling

import requests
from enum import Enum

class ScrapingResult(Enum):
    SUCCESS = "success"
    TEMPORARY_ERROR = "temporary_error"
    PERMANENT_ERROR = "permanent_error"
    RATE_LIMITED = "rate_limited"

def categorize_response(status_code):
    if 200 <= status_code < 300:
        return ScrapingResult.SUCCESS
    elif status_code in [429, 500, 502, 503, 504]:
        return ScrapingResult.TEMPORARY_ERROR
    elif status_code == 429:
        return ScrapingResult.RATE_LIMITED
    else:
        return ScrapingResult.PERMANENT_ERROR

def advanced_scraper(url):
    response = requests.get(url)
    result = categorize_response(response.status_code)

    if result == ScrapingResult.SUCCESS:
        return response.text
    elif result == ScrapingResult.RATE_LIMITED:
        # Implement exponential backoff
        return handle_rate_limit(url, response)
    elif result == ScrapingResult.TEMPORARY_ERROR:
        # Retry with backoff
        return retry_request(url)
    else:
        # Log permanent error and skip
        print(f"Permanent error {response.status_code} for {url}")
        return None

Monitoring Status Codes in Production

from collections import defaultdict
import logging

class StatusCodeMonitor:
    def __init__(self):
        self.status_counts = defaultdict(int)
        self.logger = logging.getLogger(__name__)

    def record_status(self, status_code, url):
        self.status_counts[status_code] += 1

        if status_code >= 400:
            self.logger.warning(f"HTTP {status_code} for {url}")

    def get_statistics(self):
        total_requests = sum(self.status_counts.values())
        stats = {}

        for status, count in self.status_counts.items():
            percentage = (count / total_requests) * 100
            stats[status] = {
                'count': count,
                'percentage': round(percentage, 2)
            }

        return stats

Best Practices for Status Code Management

1. Always Check Status Codes

Never assume a request was successful without checking the status code.

2. Implement Appropriate Retry Logic

Retry temporary errors (5xx, 429) but not permanent ones (4xx except 429).

3. Respect Rate Limits

When you encounter 429, honor the Retry-After header if present.

4. Log Status Codes

Keep track of status code patterns to identify issues with your scraping targets. When handling errors in web automation tools, proper status code handling becomes even more critical for maintaining reliable scrapers.

5. Handle Redirects Appropriately

Decide whether to follow redirects automatically or handle them manually based on your use case.

Testing Status Code Handling

# Using curl to test different status codes
curl -I https://httpstat.us/200  # Test 200 OK
curl -I https://httpstat.us/404  # Test 404 Not Found
curl -I https://httpstat.us/500  # Test 500 Internal Server Error

Conclusion

Understanding HTTP status codes is fundamental to building reliable web scrapers. The key is to implement appropriate handling for each category of status codes: celebrate success codes, follow redirects intelligently, retry temporary errors, and gracefully handle permanent failures. When combined with proper timeout handling and error management, comprehensive status code handling ensures your web scraping operations remain robust and efficient.

Remember that different websites may use status codes differently, so always test your scrapers against your target sites and monitor status code patterns in production to identify potential issues early.

Table of contents