Table of contents

How can I handle HTTP 404 Not Found errors gracefully?

HTTP 404 "Not Found" errors are among the most common issues encountered in web scraping and API development. These errors occur when a requested resource doesn't exist on the server, and handling them gracefully is crucial for building robust applications. This comprehensive guide covers various strategies and implementation techniques for managing 404 errors effectively.

Understanding HTTP 404 Errors

A 404 error indicates that the server cannot find the requested resource. This can happen for several reasons:

  • The URL is incorrect or has been changed
  • The resource has been moved or deleted
  • The server is misconfigured
  • Access permissions prevent resource retrieval

Basic Error Detection and Handling

Python with Requests Library

import requests
from requests.exceptions import RequestException
import time
import logging

def handle_404_gracefully(url, max_retries=3, retry_delay=1):
    """
    Handle 404 errors with retry logic and graceful degradation
    """
    for attempt in range(max_retries):
        try:
            response = requests.get(url, timeout=10)

            if response.status_code == 404:
                logging.warning(f"404 Not Found: {url}")
                return {
                    'success': False,
                    'error': '404_not_found',
                    'message': f'Resource not found at {url}',
                    'status_code': 404
                }

            # Check for other HTTP errors
            response.raise_for_status()

            return {
                'success': True,
                'data': response.text,
                'status_code': response.status_code
            }

        except requests.exceptions.Timeout:
            logging.error(f"Timeout error for {url}")
            if attempt < max_retries - 1:
                time.sleep(retry_delay * (attempt + 1))
                continue

        except RequestException as e:
            logging.error(f"Request error for {url}: {str(e)}")

    return {
        'success': False,
        'error': 'max_retries_exceeded',
        'message': f'Failed to retrieve {url} after {max_retries} attempts'
    }

# Usage example
url = "https://example.com/api/resource"
result = handle_404_gracefully(url)

if result['success']:
    print("Data retrieved successfully")
    process_data(result['data'])
else:
    if result.get('error') == '404_not_found':
        print("Resource not found, using fallback data")
        use_fallback_data()
    else:
        print(f"Error: {result['message']}")

JavaScript with Fetch API

async function handle404Gracefully(url, options = {}) {
    const {
        maxRetries = 3,
        retryDelay = 1000,
        fallbackUrl = null
    } = options;

    for (let attempt = 0; attempt < maxRetries; attempt++) {
        try {
            const response = await fetch(url, {
                timeout: 10000,
                headers: {
                    'User-Agent': 'Mozilla/5.0 (compatible; WebScraper/1.0)'
                }
            });

            if (response.status === 404) {
                console.warn(`404 Not Found: ${url}`);

                // Try fallback URL if provided
                if (fallbackUrl && attempt === 0) {
                    console.log(`Trying fallback URL: ${fallbackUrl}`);
                    return await handle404Gracefully(fallbackUrl, {
                        ...options,
                        fallbackUrl: null // Prevent infinite recursion
                    });
                }

                return {
                    success: false,
                    error: '404_not_found',
                    message: `Resource not found at ${url}`,
                    statusCode: 404
                };
            }

            if (!response.ok) {
                throw new Error(`HTTP ${response.status}: ${response.statusText}`);
            }

            const data = await response.text();
            return {
                success: true,
                data: data,
                statusCode: response.status
            };

        } catch (error) {
            console.error(`Attempt ${attempt + 1} failed for ${url}:`, error.message);

            if (attempt < maxRetries - 1) {
                await new Promise(resolve => 
                    setTimeout(resolve, retryDelay * (attempt + 1))
                );
                continue;
            }
        }
    }

    return {
        success: false,
        error: 'max_retries_exceeded',
        message: `Failed to retrieve ${url} after ${maxRetries} attempts`
    };
}

// Usage with async/await
async function scrapeWithErrorHandling() {
    const url = "https://example.com/api/data";
    const result = await handle404Gracefully(url, {
        maxRetries: 3,
        retryDelay: 2000,
        fallbackUrl: "https://example.com/api/backup-data"
    });

    if (result.success) {
        console.log("Data retrieved successfully");
        return processData(result.data);
    } else if (result.error === '404_not_found') {
        console.log("Resource not found, using cached data");
        return getCachedData();
    } else {
        console.error("Failed to retrieve data:", result.message);
        throw new Error(result.message);
    }
}

Advanced Error Handling Strategies

URL Validation and Correction

import re
from urllib.parse import urljoin, urlparse

class URLHandler:
    def __init__(self, base_url=None):
        self.base_url = base_url
        self.url_patterns = [
            r'/api/v1/',
            r'/api/v2/',
            r'/api/latest/'
        ]

    def validate_and_correct_url(self, url):
        """
        Validate URL and attempt common corrections for 404 errors
        """
        parsed = urlparse(url)

        # Check if URL is malformed
        if not parsed.scheme or not parsed.netloc:
            if self.base_url:
                url = urljoin(self.base_url, url)
            else:
                return None

        # Try common URL variations
        variations = [
            url,
            url.rstrip('/'),  # Remove trailing slash
            url + '/',        # Add trailing slash
            url.replace('/api/v1/', '/api/v2/'),  # Version upgrade
            url.replace('/api/v2/', '/api/v1/'),  # Version downgrade
        ]

        return variations

    def try_url_variations(self, original_url):
        """
        Try multiple URL variations to handle common 404 scenarios
        """
        variations = self.validate_and_correct_url(original_url)

        for url in variations:
            try:
                response = requests.head(url, timeout=5)
                if response.status_code == 200:
                    return url
            except requests.RequestException:
                continue

        return None

# Usage
handler = URLHandler("https://api.example.com")
working_url = handler.try_url_variations("https://api.example.com/v1/users/123")

Implementing Circuit Breaker Pattern

import time
from enum import Enum
from typing import Dict, Any

class CircuitState(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

class CircuitBreaker:
    def __init__(self, failure_threshold=5, recovery_timeout=60):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.failure_count = 0
        self.last_failure_time = None
        self.state = CircuitState.CLOSED

    def call(self, func, *args, **kwargs):
        if self.state == CircuitState.OPEN:
            if time.time() - self.last_failure_time < self.recovery_timeout:
                raise Exception("Circuit breaker is OPEN")
            else:
                self.state = CircuitState.HALF_OPEN

        try:
            result = func(*args, **kwargs)
            self.on_success()
            return result
        except Exception as e:
            self.on_failure()
            raise e

    def on_success(self):
        self.failure_count = 0
        self.state = CircuitState.CLOSED

    def on_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.time()

        if self.failure_count >= self.failure_threshold:
            self.state = CircuitState.OPEN

# Usage with 404 handling
def fetch_with_circuit_breaker(url):
    breaker = CircuitBreaker(failure_threshold=3, recovery_timeout=30)

    def fetch_data():
        response = requests.get(url)
        if response.status_code == 404:
            raise Exception("404 Not Found")
        response.raise_for_status()
        return response.json()

    try:
        return breaker.call(fetch_data)
    except Exception as e:
        if "404" in str(e):
            # Handle 404 specifically
            return handle_404_fallback(url)
        raise e

Web Scraping Specific Strategies

When scraping websites, 404 errors can be particularly challenging. Here's how to handle them effectively, especially when working with dynamic content that might require tools for handling browser sessions or managing page redirections:

Selenium WebDriver Error Handling

from selenium import webdriver
from selenium.common.exceptions import TimeoutException, WebDriverException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

class WebScraperWith404Handling:
    def __init__(self):
        self.driver = webdriver.Chrome()
        self.wait = WebDriverWait(self.driver, 10)

    def scrape_with_404_handling(self, url, fallback_urls=None):
        fallback_urls = fallback_urls or []
        urls_to_try = [url] + fallback_urls

        for current_url in urls_to_try:
            try:
                self.driver.get(current_url)

                # Check if page contains 404 indicators
                if self.is_404_page():
                    print(f"404 detected on {current_url}")
                    continue

                # Wait for content to load
                self.wait.until(
                    EC.presence_of_element_located((By.TAG_NAME, "body"))
                )

                return self.extract_data()

            except TimeoutException:
                print(f"Timeout loading {current_url}")
                continue
            except WebDriverException as e:
                print(f"WebDriver error for {current_url}: {str(e)}")
                continue

        raise Exception("All URLs failed, including fallbacks")

    def is_404_page(self):
        """
        Detect 404 pages by common indicators
        """
        indicators = [
            "404",
            "not found",
            "page not found",
            "error 404"
        ]

        page_text = self.driver.page_source.lower()
        title = self.driver.title.lower()

        return any(indicator in page_text or indicator in title 
                  for indicator in indicators)

    def extract_data(self):
        # Your data extraction logic here
        return {"data": "extracted_content"}

Logging and Monitoring

Proper logging is essential for tracking 404 errors and identifying patterns:

import logging
from datetime import datetime

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('scraper_errors.log'),
        logging.StreamHandler()
    ]
)

class ErrorTracker:
    def __init__(self):
        self.error_counts = {}
        self.failed_urls = []

    def log_404_error(self, url, context=None):
        """
        Log 404 error with context information
        """
        timestamp = datetime.now().isoformat()

        error_info = {
            'timestamp': timestamp,
            'url': url,
            'error_type': '404_not_found',
            'context': context or {}
        }

        # Log to file
        logging.error(f"404 Error: {url} - Context: {context}")

        # Track for analytics
        self.failed_urls.append(error_info)
        self.error_counts[url] = self.error_counts.get(url, 0) + 1

    def get_error_report(self):
        """
        Generate error report for analysis
        """
        return {
            'total_404_errors': len(self.failed_urls),
            'unique_failed_urls': len(self.error_counts),
            'most_failed_urls': sorted(
                self.error_counts.items(), 
                key=lambda x: x[1], 
                reverse=True
            )[:10]
        }

# Usage
tracker = ErrorTracker()

def scrape_with_tracking(url):
    try:
        response = requests.get(url)
        if response.status_code == 404:
            tracker.log_404_error(url, {'user_agent': 'scraper/1.0'})
            return None
        return response.text
    except Exception as e:
        logging.error(f"Unexpected error for {url}: {str(e)}")
        return None

Best Practices for 404 Error Handling

1. Implement Exponential Backoff

import random

def exponential_backoff_retry(func, max_retries=5, base_delay=1):
    for attempt in range(max_retries):
        try:
            return func()
        except requests.exceptions.HTTPError as e:
            if e.response.status_code == 404:
                # Don't retry 404s after first attempt
                if attempt == 0:
                    time.sleep(base_delay)
                    continue
                else:
                    raise e

            if attempt < max_retries - 1:
                delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
                time.sleep(delay)
            else:
                raise e

2. Use Caching for Known Good URLs

from functools import lru_cache
import pickle
import os

class URLCache:
    def __init__(self, cache_file='url_cache.pkl'):
        self.cache_file = cache_file
        self.valid_urls = self.load_cache()
        self.invalid_urls = set()

    def load_cache(self):
        if os.path.exists(self.cache_file):
            with open(self.cache_file, 'rb') as f:
                return pickle.load(f)
        return set()

    def save_cache(self):
        with open(self.cache_file, 'wb') as f:
            pickle.dump(self.valid_urls, f)

    def is_url_known_invalid(self, url):
        return url in self.invalid_urls

    def mark_url_invalid(self, url):
        self.invalid_urls.add(url)

    def mark_url_valid(self, url):
        self.valid_urls.add(url)
        self.save_cache()

3. Graceful Degradation Strategies

def fetch_with_fallback(primary_url, fallback_strategies):
    """
    Implement multiple fallback strategies for 404 errors
    """
    strategies = [
        lambda: requests.get(primary_url),
        lambda: requests.get(fallback_strategies.get('archive_url')),
        lambda: get_cached_version(primary_url),
        lambda: get_similar_content(primary_url),
        lambda: return_default_content()
    ]

    for i, strategy in enumerate(strategies):
        try:
            result = strategy()
            if hasattr(result, 'status_code') and result.status_code == 404:
                continue
            return result
        except Exception as e:
            logging.warning(f"Strategy {i} failed: {str(e)}")
            continue

    raise Exception("All fallback strategies failed")

Using WebScraping.AI for Robust Error Handling

When dealing with 404 errors in production web scraping applications, consider using specialized services like WebScraping.AI that handle many of these error scenarios automatically:

curl -X GET "https://api.webscraping.ai/html" \
     -H "api-key: YOUR_API_KEY" \
     -G \
     -d "url=https://example.com/might-be-404" \
     -d "error_on_404=false" \
     -d "timeout=15000"

This approach provides built-in error handling, retry logic, and fallback mechanisms without requiring custom implementation.

Testing 404 Error Handling

import unittest
from unittest.mock import Mock, patch

class Test404Handling(unittest.TestCase):

    def test_404_error_handling(self):
        with patch('requests.get') as mock_get:
            # Mock 404 response
            mock_response = Mock()
            mock_response.status_code = 404
            mock_get.return_value = mock_response

            result = handle_404_gracefully('http://test.com/missing')

            self.assertFalse(result['success'])
            self.assertEqual(result['error'], '404_not_found')

    def test_fallback_url_success(self):
        with patch('requests.get') as mock_get:
            # First call returns 404, second call succeeds
            mock_404 = Mock()
            mock_404.status_code = 404

            mock_200 = Mock()
            mock_200.status_code = 200
            mock_200.text = "fallback content"

            mock_get.side_effect = [mock_404, mock_200]

            # Test your fallback logic here
            # self.assertTrue(result['success'])

Conclusion

Handling HTTP 404 errors gracefully is essential for building robust web scraping and API integration systems. The key strategies include:

  • Implementing proper error detection with status code checking
  • Using retry logic with exponential backoff for transient issues
  • Providing fallback mechanisms such as alternative URLs or cached data
  • Logging and monitoring 404 errors for pattern analysis
  • Implementing circuit breakers to prevent cascading failures
  • Validating and correcting URLs automatically when possible

By combining these techniques, you can create resilient applications that handle 404 errors gracefully while maintaining good user experience and system stability. Remember to always respect robots.txt files and implement appropriate rate limiting to avoid overwhelming target servers.

For complex scraping scenarios involving dynamic content, consider implementing proper error handling strategies in Puppeteer alongside these HTTP-level techniques.

The strategies outlined in this guide will help you build more reliable web scraping and API integration solutions that can handle the inevitable 404 errors that occur in real-world scenarios.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon