Table of contents

How do I handle HTTP status codes effectively with urllib3?

HTTP status codes are essential indicators of how your web requests are being processed by servers. When using urllib3 for web scraping or API interactions, proper status code handling ensures your applications are robust, reliable, and can gracefully handle various server responses. This guide covers comprehensive strategies for handling HTTP status codes with urllib3.

Understanding HTTP Status Codes

HTTP status codes are three-digit numbers that indicate the outcome of HTTP requests. They're grouped into five categories:

  • 1xx (Informational): Request received, continuing process
  • 2xx (Success): Request was successfully received, understood, and accepted
  • 3xx (Redirection): Further action needs to be taken to complete the request
  • 4xx (Client Error): Request contains bad syntax or cannot be fulfilled
  • 5xx (Server Error): Server failed to fulfill an apparently valid request

Basic Status Code Handling with urllib3

Here's how to check and handle HTTP status codes with urllib3:

import urllib3
from urllib3.exceptions import HTTPError

# Create a PoolManager instance
http = urllib3.PoolManager()

def make_request_with_status_handling(url):
    try:
        response = http.request('GET', url)

        # Check status code
        if response.status == 200:
            print("Success! Data retrieved successfully")
            return response.data.decode('utf-8')
        elif response.status == 404:
            print("Error: Resource not found")
            return None
        elif response.status == 403:
            print("Error: Access forbidden")
            return None
        elif response.status == 500:
            print("Error: Internal server error")
            return None
        else:
            print(f"Unexpected status code: {response.status}")
            return None

    except HTTPError as e:
        print(f"HTTP error occurred: {e}")
        return None
    except Exception as e:
        print(f"An error occurred: {e}")
        return None

# Usage example
url = "https://httpbin.org/status/200"
result = make_request_with_status_handling(url)

Comprehensive Status Code Handling Strategy

For production applications, implement a more comprehensive approach:

import urllib3
import time
from urllib3.exceptions import HTTPError, TimeoutError, MaxRetryError

class StatusCodeHandler:
    def __init__(self, max_retries=3, backoff_factor=1):
        self.http = urllib3.PoolManager(
            retries=urllib3.Retry(
                total=max_retries,
                backoff_factor=backoff_factor,
                status_forcelist=[429, 500, 502, 503, 504]
            )
        )

    def handle_response(self, response):
        """Handle different HTTP status codes"""
        status_handlers = {
            200: self._handle_success,
            201: self._handle_created,
            204: self._handle_no_content,
            301: self._handle_redirect,
            302: self._handle_redirect,
            400: self._handle_bad_request,
            401: self._handle_unauthorized,
            403: self._handle_forbidden,
            404: self._handle_not_found,
            429: self._handle_rate_limit,
            500: self._handle_server_error,
            502: self._handle_bad_gateway,
            503: self._handle_service_unavailable,
        }

        handler = status_handlers.get(response.status, self._handle_unknown)
        return handler(response)

    def _handle_success(self, response):
        return {
            'success': True,
            'data': response.data.decode('utf-8'),
            'status': response.status
        }

    def _handle_created(self, response):
        return {
            'success': True,
            'message': 'Resource created successfully',
            'status': response.status
        }

    def _handle_no_content(self, response):
        return {
            'success': True,
            'message': 'Operation completed successfully',
            'status': response.status
        }

    def _handle_redirect(self, response):
        return {
            'success': False,
            'error': 'Redirect not followed automatically',
            'location': response.headers.get('Location'),
            'status': response.status
        }

    def _handle_bad_request(self, response):
        return {
            'success': False,
            'error': 'Bad request - check your parameters',
            'status': response.status
        }

    def _handle_unauthorized(self, response):
        return {
            'success': False,
            'error': 'Authentication required',
            'status': response.status
        }

    def _handle_forbidden(self, response):
        return {
            'success': False,
            'error': 'Access forbidden - insufficient permissions',
            'status': response.status
        }

    def _handle_not_found(self, response):
        return {
            'success': False,
            'error': 'Resource not found',
            'status': response.status
        }

    def _handle_rate_limit(self, response):
        retry_after = response.headers.get('Retry-After', 60)
        return {
            'success': False,
            'error': f'Rate limited - retry after {retry_after} seconds',
            'retry_after': int(retry_after),
            'status': response.status
        }

    def _handle_server_error(self, response):
        return {
            'success': False,
            'error': 'Internal server error',
            'status': response.status
        }

    def _handle_bad_gateway(self, response):
        return {
            'success': False,
            'error': 'Bad gateway - server acting as proxy received invalid response',
            'status': response.status
        }

    def _handle_service_unavailable(self, response):
        return {
            'success': False,
            'error': 'Service temporarily unavailable',
            'status': response.status
        }

    def _handle_unknown(self, response):
        return {
            'success': False,
            'error': f'Unknown status code: {response.status}',
            'status': response.status
        }

# Usage example
handler = StatusCodeHandler()

def make_robust_request(url, method='GET', **kwargs):
    try:
        response = handler.http.request(method, url, **kwargs)
        return handler.handle_response(response)
    except MaxRetryError as e:
        return {
            'success': False,
            'error': f'Max retries exceeded: {e}',
            'status': None
        }
    except TimeoutError as e:
        return {
            'success': False,
            'error': f'Request timeout: {e}',
            'status': None
        }
    except Exception as e:
        return {
            'success': False,
            'error': f'Unexpected error: {e}',
            'status': None
        }

# Test the implementation
result = make_robust_request('https://httpbin.org/status/404')
print(result)

Implementing Retry Logic for Specific Status Codes

urllib3 provides built-in retry functionality that can be customized for specific status codes:

import urllib3
from urllib3.util.retry import Retry

# Configure retry strategy
retry_strategy = Retry(
    total=5,  # Total number of retries
    status_forcelist=[429, 500, 502, 503, 504],  # Status codes to retry
    method_whitelist=["HEAD", "GET", "OPTIONS"],  # HTTP methods to retry
    backoff_factor=1,  # Backoff factor for retry delays
    raise_on_status=False  # Don't raise exceptions on bad status codes
)

# Create PoolManager with retry configuration
http = urllib3.PoolManager(retries=retry_strategy)

def make_request_with_retries(url):
    try:
        response = http.request('GET', url)

        if response.status in [200, 201, 204]:
            return {
                'success': True,
                'data': response.data.decode('utf-8'),
                'status': response.status
            }
        else:
            return {
                'success': False,
                'error': f'Request failed with status {response.status}',
                'status': response.status
            }

    except urllib3.exceptions.MaxRetryError as e:
        return {
            'success': False,
            'error': f'Max retries exceeded: {e}',
            'status': None
        }

# Usage example
result = make_request_with_retries('https://httpbin.org/status/503')
print(result)

Advanced Status Code Handling with Context Managers

For better resource management and consistent error handling:

import urllib3
from contextlib import contextmanager
import logging

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

@contextmanager
def http_client(timeout=30, retries=3):
    """Context manager for urllib3 HTTP client with proper cleanup"""
    retry_config = urllib3.Retry(
        total=retries,
        status_forcelist=[429, 500, 502, 503, 504],
        backoff_factor=0.3
    )

    http = urllib3.PoolManager(
        timeout=urllib3.Timeout(connect=timeout, read=timeout),
        retries=retry_config
    )

    try:
        yield http
    finally:
        http.clear()

class HTTPStatusHandler:
    @staticmethod
    def is_success(status_code):
        """Check if status code indicates success"""
        return 200 <= status_code < 300

    @staticmethod
    def is_client_error(status_code):
        """Check if status code indicates client error"""
        return 400 <= status_code < 500

    @staticmethod
    def is_server_error(status_code):
        """Check if status code indicates server error"""
        return 500 <= status_code < 600

    @staticmethod
    def should_retry(status_code):
        """Determine if request should be retried based on status code"""
        return status_code in [429, 500, 502, 503, 504]

    @staticmethod
    def log_status(url, status_code, method='GET'):
        """Log HTTP status information"""
        if HTTPStatusHandler.is_success(status_code):
            logger.info(f"{method} {url} - Success ({status_code})")
        elif HTTPStatusHandler.is_client_error(status_code):
            logger.warning(f"{method} {url} - Client Error ({status_code})")
        elif HTTPStatusHandler.is_server_error(status_code):
            logger.error(f"{method} {url} - Server Error ({status_code})")

def fetch_with_comprehensive_handling(url, method='GET', **kwargs):
    """Fetch URL with comprehensive status code handling"""
    with http_client() as http:
        try:
            response = http.request(method, url, **kwargs)

            # Log the status
            HTTPStatusHandler.log_status(url, response.status, method)

            # Handle based on status code category
            if HTTPStatusHandler.is_success(response.status):
                return {
                    'success': True,
                    'data': response.data.decode('utf-8'),
                    'status': response.status,
                    'headers': dict(response.headers)
                }

            elif HTTPStatusHandler.is_client_error(response.status):
                error_messages = {
                    400: "Bad Request - Invalid parameters",
                    401: "Unauthorized - Authentication required",
                    403: "Forbidden - Access denied",
                    404: "Not Found - Resource doesn't exist",
                    409: "Conflict - Resource conflict",
                    422: "Unprocessable Entity - Validation failed",
                    429: "Too Many Requests - Rate limited"
                }

                error_msg = error_messages.get(
                    response.status, 
                    f"Client error ({response.status})"
                )

                return {
                    'success': False,
                    'error': error_msg,
                    'status': response.status,
                    'retry_recommended': response.status == 429
                }

            elif HTTPStatusHandler.is_server_error(response.status):
                return {
                    'success': False,
                    'error': f"Server error ({response.status})",
                    'status': response.status,
                    'retry_recommended': HTTPStatusHandler.should_retry(response.status)
                }

            else:
                return {
                    'success': False,
                    'error': f"Unexpected status code: {response.status}",
                    'status': response.status,
                    'retry_recommended': False
                }

        except Exception as e:
            logger.error(f"Request failed: {e}")
            return {
                'success': False,
                'error': str(e),
                'status': None,
                'retry_recommended': True
            }

# Usage example
result = fetch_with_comprehensive_handling('https://httpbin.org/status/200')
print(result)

Best Practices for HTTP Status Code Handling

1. Always Check Status Codes

Never assume a request succeeded without checking the status code:

response = http.request('GET', url)
if response.status != 200:
    # Handle non-200 responses appropriately
    handle_error(response.status, response.data)

2. Implement Proper Logging

Log different status codes at appropriate levels:

import logging

def log_response(response, url):
    if 200 <= response.status < 300:
        logging.info(f"Success: {url} returned {response.status}")
    elif 400 <= response.status < 500:
        logging.warning(f"Client error: {url} returned {response.status}")
    elif 500 <= response.status < 600:
        logging.error(f"Server error: {url} returned {response.status}")

3. Handle Rate Limiting Gracefully

Respect rate limits and implement exponential backoff:

def handle_rate_limit(response):
    if response.status == 429:
        retry_after = int(response.headers.get('Retry-After', 60))
        print(f"Rate limited. Waiting {retry_after} seconds...")
        time.sleep(retry_after)
        return True
    return False

4. Distinguish Between Retryable and Non-Retryable Errors

Not all errors should trigger retries:

RETRYABLE_STATUS_CODES = [429, 500, 502, 503, 504]
NON_RETRYABLE_STATUS_CODES = [400, 401, 403, 404, 422]

def should_retry_request(status_code):
    return status_code in RETRYABLE_STATUS_CODES

Integration with Web Scraping Workflows

When building web scrapers, proper HTTP status code handling is crucial. While urllib3 provides low-level HTTP handling, you might also consider browser automation tools for JavaScript-heavy sites. For complex scenarios involving dynamic content, handling authentication flows or managing timeouts effectively might require additional tools beyond urllib3.

Conclusion

Effective HTTP status code handling with urllib3 involves understanding the meaning of different status codes, implementing appropriate retry logic, and building robust error handling mechanisms. By following the patterns shown in this guide, you can create reliable applications that gracefully handle various server responses and network conditions.

Remember to always log relevant information, respect rate limits, and implement proper retry strategies for transient errors. This approach ensures your applications are resilient and provide meaningful feedback when issues occur.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon