Table of contents

What are the common HTTP status codes you encounter when scraping APIs?

When scraping APIs, understanding HTTP status codes is crucial for building robust and reliable scrapers. These three-digit codes provide immediate feedback about the success or failure of your requests, enabling you to implement appropriate error handling and retry logic. This comprehensive guide covers the most common HTTP status codes you'll encounter during API scraping and how to handle them effectively.

Understanding HTTP Status Code Categories

HTTP status codes are organized into five categories, each indicating a different type of response:

  • 1xx (Informational): Request received, continuing process
  • 2xx (Success): Request successfully received, understood, and accepted
  • 3xx (Redirection): Further action must be taken to complete the request
  • 4xx (Client Error): Request contains bad syntax or cannot be fulfilled
  • 5xx (Server Error): Server failed to fulfill an apparently valid request

Common Success Status Codes (2xx)

200 OK

The most common success code, indicating that the request was successful and the server returned the requested data.

import requests

response = requests.get('https://api.example.com/data')
if response.status_code == 200:
    data = response.json()
    print("Data retrieved successfully:", data)
fetch('https://api.example.com/data')
  .then(response => {
    if (response.status === 200) {
      return response.json();
    }
    throw new Error(`HTTP ${response.status}`);
  })
  .then(data => console.log('Data retrieved:', data));

201 Created

Indicates that a new resource was successfully created, typically returned after POST requests.

import requests

data = {'name': 'New Item', 'value': 123}
response = requests.post('https://api.example.com/items', json=data)
if response.status_code == 201:
    print("Resource created successfully")
    new_item = response.json()

204 No Content

The request was successful, but there's no content to return. Common with DELETE operations or updates that don't return data.

response = requests.delete('https://api.example.com/items/123')
if response.status_code == 204:
    print("Item deleted successfully")

Redirection Status Codes (3xx)

301 Moved Permanently

The resource has been permanently moved to a new location. Most HTTP libraries handle this automatically.

import requests

# requests library follows redirects automatically
response = requests.get('https://api.example.com/old-endpoint')
print(f"Final URL: {response.url}")
print(f"Redirect history: {[r.url for r in response.history]}")

302 Found (Temporary Redirect)

Similar to 301, but indicates a temporary move. Handle the same way as 301.

304 Not Modified

Used with conditional requests to indicate cached content is still valid. Implement caching to leverage this:

import requests

headers = {'If-None-Match': 'stored-etag-value'}
response = requests.get('https://api.example.com/data', headers=headers)
if response.status_code == 304:
    print("Use cached data")
else:
    # Process new data and store new ETag
    new_etag = response.headers.get('ETag')

Client Error Status Codes (4xx)

400 Bad Request

The request syntax is malformed or invalid. Check your request parameters, headers, and body format.

import requests

try:
    response = requests.post('https://api.example.com/data', json={'invalid': 'data'})
    if response.status_code == 400:
        error_details = response.json()
        print(f"Bad request: {error_details}")
        # Fix the request based on error details
except requests.exceptions.RequestException as e:
    print(f"Request failed: {e}")

401 Unauthorized

Authentication is required or has failed. Implement proper authentication handling:

import requests
from requests.auth import HTTPBasicAuth

# Basic authentication
response = requests.get(
    'https://api.example.com/protected',
    auth=HTTPBasicAuth('username', 'password')
)

# Token-based authentication
headers = {'Authorization': 'Bearer your-access-token'}
response = requests.get('https://api.example.com/protected', headers=headers)

if response.status_code == 401:
    print("Authentication failed - refresh token or check credentials")

403 Forbidden

The server understands the request but refuses to authorize it. This often indicates insufficient permissions or blocked access.

def handle_forbidden_response(response):
    if response.status_code == 403:
        print("Access forbidden - check API permissions or rate limits")
        # Implement backoff strategy or alternative approach
        return False
    return True

404 Not Found

The requested resource doesn't exist. Common when scraping APIs with dynamic endpoints.

import requests

def safe_api_request(url):
    try:
        response = requests.get(url)
        if response.status_code == 404:
            print(f"Resource not found: {url}")
            return None
        response.raise_for_status()
        return response.json()
    except requests.exceptions.RequestException as e:
        print(f"Request error: {e}")
        return None

429 Too Many Requests

Rate limiting is in effect. Implement exponential backoff and respect rate limits:

import time
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

def create_session_with_retries():
    session = requests.Session()

    # Configure retry strategy
    retry_strategy = Retry(
        total=3,
        status_forcelist=[429, 500, 502, 503, 504],
        backoff_factor=1,
        respect_retry_after_header=True
    )

    adapter = HTTPAdapter(max_retries=retry_strategy)
    session.mount("http://", adapter)
    session.mount("https://", adapter)

    return session

# Usage
session = create_session_with_retries()
response = session.get('https://api.example.com/data')
async function fetchWithRetry(url, options = {}, maxRetries = 3) {
    for (let attempt = 1; attempt <= maxRetries; attempt++) {
        try {
            const response = await fetch(url, options);

            if (response.status === 429) {
                const retryAfter = response.headers.get('Retry-After');
                const delay = retryAfter ? parseInt(retryAfter) * 1000 : Math.pow(2, attempt) * 1000;

                console.log(`Rate limited. Waiting ${delay}ms before retry ${attempt}`);
                await new Promise(resolve => setTimeout(resolve, delay));
                continue;
            }

            return response;
        } catch (error) {
            if (attempt === maxRetries) throw error;
            await new Promise(resolve => setTimeout(resolve, Math.pow(2, attempt) * 1000));
        }
    }
}

Server Error Status Codes (5xx)

500 Internal Server Error

Generic server error. Implement retry logic with exponential backoff:

import time
import random

def exponential_backoff_retry(func, max_retries=3, base_delay=1):
    for attempt in range(max_retries):
        try:
            response = func()
            if response.status_code < 500:
                return response
        except Exception as e:
            if attempt == max_retries - 1:
                raise e

        # Exponential backoff with jitter
        delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
        time.sleep(delay)

    return None

502 Bad Gateway / 503 Service Unavailable / 504 Gateway Timeout

These indicate temporary server issues. Handle them with retry logic similar to 500 errors.

Comprehensive Error Handling Strategy

Here's a complete example that handles multiple status codes appropriately:

import requests
import time
import logging
from typing import Optional, Dict, Any

class APIClient:
    def __init__(self, base_url: str, api_key: str = None):
        self.base_url = base_url
        self.session = requests.Session()
        if api_key:
            self.session.headers.update({'Authorization': f'Bearer {api_key}'})

    def make_request(self, endpoint: str, method: str = 'GET', **kwargs) -> Optional[Dict[Any, Any]]:
        url = f"{self.base_url}/{endpoint.lstrip('/')}"
        max_retries = 3

        for attempt in range(max_retries):
            try:
                response = self.session.request(method, url, **kwargs)

                if response.status_code == 200:
                    return response.json()
                elif response.status_code == 201:
                    logging.info("Resource created successfully")
                    return response.json() if response.content else {}
                elif response.status_code == 204:
                    logging.info("Operation completed successfully")
                    return {}
                elif response.status_code == 401:
                    logging.error("Authentication failed")
                    raise AuthenticationError("Invalid credentials")
                elif response.status_code == 403:
                    logging.error("Access forbidden")
                    raise PermissionError("Insufficient permissions")
                elif response.status_code == 404:
                    logging.warning(f"Resource not found: {url}")
                    return None
                elif response.status_code == 429:
                    retry_after = int(response.headers.get('Retry-After', 60))
                    logging.warning(f"Rate limited. Waiting {retry_after} seconds")
                    time.sleep(retry_after)
                    continue
                elif 500 <= response.status_code < 600:
                    if attempt < max_retries - 1:
                        delay = 2 ** attempt
                        logging.warning(f"Server error {response.status_code}. Retrying in {delay}s")
                        time.sleep(delay)
                        continue
                    else:
                        logging.error(f"Server error {response.status_code} after {max_retries} attempts")
                        raise ServerError(f"Server error: {response.status_code}")
                else:
                    logging.error(f"Unexpected status code: {response.status_code}")
                    response.raise_for_status()

            except requests.exceptions.RequestException as e:
                if attempt < max_retries - 1:
                    time.sleep(2 ** attempt)
                    continue
                raise NetworkError(f"Network error: {e}")

        return None

# Custom exceptions
class AuthenticationError(Exception):
    pass

class ServerError(Exception):
    pass

class NetworkError(Exception):
    pass

Testing Status Code Handling

When developing scrapers, it's important to test how your code handles different status codes:

import unittest
from unittest.mock import Mock, patch

class TestAPIClient(unittest.TestCase):
    def setUp(self):
        self.client = APIClient('https://api.example.com', 'test-key')

    @patch('requests.Session.request')
    def test_429_retry_logic(self, mock_request):
        # Mock 429 response then success
        mock_429 = Mock()
        mock_429.status_code = 429
        mock_429.headers = {'Retry-After': '1'}

        mock_200 = Mock()
        mock_200.status_code = 200
        mock_200.json.return_value = {'data': 'success'}

        mock_request.side_effect = [mock_429, mock_200]

        result = self.client.make_request('/test')
        self.assertEqual(result, {'data': 'success'})
        self.assertEqual(mock_request.call_count, 2)

Best Practices for Status Code Handling

  1. Always check status codes explicitly rather than assuming success
  2. Implement appropriate retry logic for temporary failures (5xx, 429)
  3. Respect rate limits by implementing exponential backoff
  4. Log status codes and errors for debugging and monitoring
  5. Handle authentication errors by refreshing tokens when possible
  6. Use libraries that support automatic retries when available
  7. Test your error handling code with mocked responses

Understanding and properly handling HTTP status codes is essential for building reliable API scrapers. When dealing with complex scraping scenarios involving dynamic content, you might also want to explore how to handle AJAX requests using Puppeteer for JavaScript-heavy applications, or learn about handling timeouts in Puppeteer when working with slower-responding APIs.

By implementing comprehensive status code handling, you'll create more robust scrapers that can gracefully handle various API responses and continue operating reliably even when encountering errors or temporary issues.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon