When developing web scrapers, proper handling of HTTP status codes is essential for both respectful scraping and effective error handling. The HTTP status codes are returned by servers to indicate the result of a client's request. Here is a guide on how to handle these codes properly in your web scraping projects:
Common HTTP Status Codes and Their Meanings
200 OK
: The request was successful, and the content can be scraped.301 Moved Permanently
/302 Found
: These indicate redirection. Your scraper should follow the new location provided in theLocation
header.400 Bad Request
: The server could not understand the request due to invalid syntax.401 Unauthorized
: Authentication is required to access the resource.403 Forbidden
: The server understood the request but refuses to authorize it. This could mean that scraping is not allowed.404 Not Found
: The resource was not found. Your scraper should handle this gracefully.429 Too Many Requests
: You are being rate-limited. Implement a delay or back-off strategy.500 Internal Server Error
: The server encountered an unexpected condition.503 Service Unavailable
: The server is not ready to handle the request, perhaps due to maintenance or overload. You may retry after some time.
Handling Status Codes in Python with requests
Here's an example using the requests
library in Python to handle different HTTP status codes:
import requests
from requests.exceptions import HTTPError
url = "http://example.com/some-page"
try:
response = requests.get(url)
response.raise_for_status() # Will raise an HTTPError if the HTTP request returned an unsuccessful status code
# If the status code is 200, proceed to scrape the content
if response.status_code == 200:
# Your scraping logic here
pass
except HTTPError as http_err:
# Specific handling for different status codes can be done here
if response.status_code == 404:
print("Resource not found.")
elif response.status_code == 403:
print("Access forbidden. The website may not allow scraping.")
elif response.status_code == 429:
print("Rate limit exceeded. Try slowing down your requests.")
else:
print(f"HTTP error occurred: {http_err}")
except Exception as err:
print(f"An error occurred: {err}")
Handling Status Codes in JavaScript with axios
Similarly, in JavaScript, you might use the axios
library to make HTTP requests and handle responses:
const axios = require('axios');
const url = "http://example.com/some-page";
axios.get(url)
.then(response => {
if (response.status === 200) {
// Your scraping logic here
}
})
.catch(error => {
if (error.response) {
// The server responded with a status code outside the 2xx range
switch (error.response.status) {
case 404:
console.error("Resource not found.");
break;
case 403:
console.error("Access forbidden. The website may not allow scraping.");
break;
case 429:
console.error("Rate limit exceeded. Try slowing down your requests.");
break;
default:
console.error(`HTTP error occurred: ${error.response.status}`);
}
} else if (error.request) {
// The request was made but no response was received
console.error("No response received from the server.");
} else {
// Something else triggered an error
console.error("Error", error.message);
}
});
Best Practices
- Respect the site's
robots.txt
: Before scraping, check the site'srobots.txt
file to ensure you're allowed to scrape the pages in question. - Handle redirections: If a
301
or302
status code is received, follow the redirection by making a new request to the URL provided in theLocation
header. - Be polite with your scraping: Implement delays between requests or comply with the site's rate limiting by respecting the
Retry-After
header when you receive a429
status code. - Use custom headers: Sometimes, including a User-Agent header or other custom headers can prevent a
403 Forbidden
error. - Graceful degradation: If you receive a
404
, log it and move on to the next resource instead of treating it as a fatal error.
By carefully handling HTTP status codes, you can create web scrapers that are more robust and respectful of the target sites' rules and limitations.