What is the proper way to use HTTP status codes for error handling in web scrapers?

When developing web scrapers, proper handling of HTTP status codes is essential for both respectful scraping and effective error handling. The HTTP status codes are returned by servers to indicate the result of a client's request. Here is a guide on how to handle these codes properly in your web scraping projects:

Common HTTP Status Codes and Their Meanings

  • 200 OK: The request was successful, and the content can be scraped.
  • 301 Moved Permanently / 302 Found: These indicate redirection. Your scraper should follow the new location provided in the Location header.
  • 400 Bad Request: The server could not understand the request due to invalid syntax.
  • 401 Unauthorized: Authentication is required to access the resource.
  • 403 Forbidden: The server understood the request but refuses to authorize it. This could mean that scraping is not allowed.
  • 404 Not Found: The resource was not found. Your scraper should handle this gracefully.
  • 429 Too Many Requests: You are being rate-limited. Implement a delay or back-off strategy.
  • 500 Internal Server Error: The server encountered an unexpected condition.
  • 503 Service Unavailable: The server is not ready to handle the request, perhaps due to maintenance or overload. You may retry after some time.

Handling Status Codes in Python with requests

Here's an example using the requests library in Python to handle different HTTP status codes:

import requests
from requests.exceptions import HTTPError

url = "http://example.com/some-page"

try:
    response = requests.get(url)
    response.raise_for_status()  # Will raise an HTTPError if the HTTP request returned an unsuccessful status code

    # If the status code is 200, proceed to scrape the content
    if response.status_code == 200:
        # Your scraping logic here
        pass

except HTTPError as http_err:
    # Specific handling for different status codes can be done here
    if response.status_code == 404:
        print("Resource not found.")
    elif response.status_code == 403:
        print("Access forbidden. The website may not allow scraping.")
    elif response.status_code == 429:
        print("Rate limit exceeded. Try slowing down your requests.")
    else:
        print(f"HTTP error occurred: {http_err}")
except Exception as err:
    print(f"An error occurred: {err}")

Handling Status Codes in JavaScript with axios

Similarly, in JavaScript, you might use the axios library to make HTTP requests and handle responses:

const axios = require('axios');

const url = "http://example.com/some-page";

axios.get(url)
    .then(response => {
        if (response.status === 200) {
            // Your scraping logic here
        }
    })
    .catch(error => {
        if (error.response) {
            // The server responded with a status code outside the 2xx range
            switch (error.response.status) {
                case 404:
                    console.error("Resource not found.");
                    break;
                case 403:
                    console.error("Access forbidden. The website may not allow scraping.");
                    break;
                case 429:
                    console.error("Rate limit exceeded. Try slowing down your requests.");
                    break;
                default:
                    console.error(`HTTP error occurred: ${error.response.status}`);
            }
        } else if (error.request) {
            // The request was made but no response was received
            console.error("No response received from the server.");
        } else {
            // Something else triggered an error
            console.error("Error", error.message);
        }
    });

Best Practices

  • Respect the site's robots.txt: Before scraping, check the site's robots.txt file to ensure you're allowed to scrape the pages in question.
  • Handle redirections: If a 301 or 302 status code is received, follow the redirection by making a new request to the URL provided in the Location header.
  • Be polite with your scraping: Implement delays between requests or comply with the site's rate limiting by respecting the Retry-After header when you receive a 429 status code.
  • Use custom headers: Sometimes, including a User-Agent header or other custom headers can prevent a 403 Forbidden error.
  • Graceful degradation: If you receive a 404, log it and move on to the next resource instead of treating it as a fatal error.

By carefully handling HTTP status codes, you can create web scrapers that are more robust and respectful of the target sites' rules and limitations.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon