What are the Common HTTP Status Codes I Should Handle in PHP Scraping?
When building web scrapers in PHP, proper HTTP status code handling is crucial for creating robust and reliable applications. Understanding and properly responding to different status codes helps prevent crashes, enables graceful error handling, and improves the overall user experience of your scraping applications.
Understanding HTTP Status Codes
HTTP status codes are three-digit numbers returned by web servers to indicate the result of a client's request. They are grouped into five categories based on their first digit, each representing a different type of response.
Essential HTTP Status Codes for PHP Scraping
1. Success Codes (2xx)
200 OK
The most common success status code indicating that the request was successful and the server returned the requested content.
<?php
function scrapeWithCurl($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (compatible; PHP Scraper)');
$response = curl_exec($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
if ($httpCode === 200) {
echo "Success! Content retrieved.\n";
return $response;
}
curl_close($ch);
return false;
}
?>
201 Created
Typically returned after successful POST requests that create new resources.
204 No Content
Indicates successful request but no content to return. Common with DELETE operations.
2. Redirection Codes (3xx)
301 Moved Permanently
The resource has been permanently moved to a new URL. Update your scraper to use the new URL.
302 Found (Temporary Redirect)
The resource is temporarily located at a different URL.
<?php
function handleRedirects($url, $maxRedirects = 5) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_MAXREDIRS, $maxRedirects);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (compatible; PHP Scraper)');
$response = curl_exec($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
$finalUrl = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
if ($httpCode >= 300 && $httpCode < 400) {
echo "Redirected to: " . $finalUrl . "\n";
}
curl_close($ch);
return $response;
}
?>
3. Client Error Codes (4xx)
400 Bad Request
The server cannot process the request due to invalid syntax or missing parameters.
401 Unauthorized
Authentication is required but has failed or not been provided.
403 Forbidden
The server understands the request but refuses to authorize it. Often indicates that your scraper has been blocked.
404 Not Found
The requested resource does not exist on the server.
429 Too Many Requests
Rate limiting is in effect. Your scraper is making requests too quickly.
<?php
function handleClientErrors($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (compatible; PHP Scraper)');
$response = curl_exec($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
switch ($httpCode) {
case 400:
throw new Exception("Bad Request: Check your request parameters");
break;
case 401:
throw new Exception("Unauthorized: Authentication required");
break;
case 403:
echo "Access forbidden. Implementing delay...\n";
sleep(5); // Wait before retrying
break;
case 404:
echo "Resource not found. Skipping...\n";
break;
case 429:
echo "Rate limited. Waiting 60 seconds...\n";
sleep(60);
break;
}
curl_close($ch);
return $response;
}
?>
4. Server Error Codes (5xx)
500 Internal Server Error
A generic error message indicating that the server encountered an unexpected condition.
502 Bad Gateway
The server received an invalid response from an upstream server.
503 Service Unavailable
The server is temporarily overloaded or under maintenance.
504 Gateway Timeout
The server did not receive a timely response from an upstream server.
Comprehensive Status Code Handling with Guzzle
For more advanced HTTP handling, consider using Guzzle HTTP client:
<?php
require 'vendor/autoload.php';
use GuzzleHttp\Client;
use GuzzleHttp\Exception\ClientException;
use GuzzleHttp\Exception\ServerException;
use GuzzleHttp\Exception\RequestException;
class AdvancedScraper {
private $client;
private $maxRetries = 3;
public function __construct() {
$this->client = new Client([
'timeout' => 30,
'headers' => [
'User-Agent' => 'Mozilla/5.0 (compatible; PHP Scraper)'
]
]);
}
public function scrapeUrl($url) {
$attempt = 0;
while ($attempt < $this->maxRetries) {
try {
$response = $this->client->get($url);
$statusCode = $response->getStatusCode();
if ($statusCode === 200) {
return $response->getBody()->getContents();
}
} catch (ClientException $e) {
$statusCode = $e->getResponse()->getStatusCode();
$this->handleClientError($statusCode, $attempt);
} catch (ServerException $e) {
$statusCode = $e->getResponse()->getStatusCode();
$this->handleServerError($statusCode, $attempt);
} catch (RequestException $e) {
echo "Request failed: " . $e->getMessage() . "\n";
break;
}
$attempt++;
}
return false;
}
private function handleClientError($statusCode, $attempt) {
switch ($statusCode) {
case 404:
echo "Resource not found. Stopping attempts.\n";
return false;
case 403:
echo "Access forbidden. Attempt {$attempt}. Waiting...\n";
sleep(pow(2, $attempt)); // Exponential backoff
break;
case 429:
echo "Rate limited. Attempt {$attempt}. Waiting...\n";
sleep(60 * ($attempt + 1));
break;
default:
echo "Client error {$statusCode}. Attempt {$attempt}.\n";
}
}
private function handleServerError($statusCode, $attempt) {
echo "Server error {$statusCode}. Attempt {$attempt}. Retrying...\n";
sleep(5 * ($attempt + 1)); // Progressive delay
}
}
// Usage
$scraper = new AdvancedScraper();
$content = $scraper->scrapeUrl('https://example.com');
?>
Best Practices for Status Code Handling
1. Implement Retry Logic
Not all errors are permanent. Server errors (5xx) and some client errors like 429 (rate limiting) should trigger retry attempts with appropriate delays.
2. Use Exponential Backoff
When retrying failed requests, implement exponential backoff to avoid overwhelming the server:
<?php
function exponentialBackoff($attempt, $baseDelay = 1) {
$delay = $baseDelay * pow(2, $attempt);
$jitter = rand(0, 1000) / 1000; // Add randomness
sleep($delay + $jitter);
}
?>
3. Log Status Codes
Keep detailed logs of HTTP status codes for debugging and monitoring:
<?php
function logStatusCode($url, $statusCode, $timestamp = null) {
$timestamp = $timestamp ?: date('Y-m-d H:i:s');
$logEntry = "[{$timestamp}] {$url} returned {$statusCode}\n";
file_put_contents('scraper.log', $logEntry, FILE_APPEND);
}
?>
4. Handle Different Content Types
Check the Content-Type header along with status codes to ensure you're receiving the expected data format.
Integration with Web Scraping APIs
When working with professional web scraping services, status code handling becomes even more important. For instance, when using APIs that handle complex JavaScript-rendered content similar to how to handle AJAX requests using Puppeteer, you need to properly interpret API-specific status codes that indicate successful data extraction versus various error conditions.
Monitoring and Alerting
Set up monitoring for unusual status code patterns:
<?php
class StatusCodeMonitor {
private $statusCounts = [];
public function recordStatus($statusCode) {
if (!isset($this->statusCounts[$statusCode])) {
$this->statusCounts[$statusCode] = 0;
}
$this->statusCounts[$statusCode]++;
}
public function getErrorRate() {
$total = array_sum($this->statusCounts);
$errors = 0;
foreach ($this->statusCounts as $code => $count) {
if ($code >= 400) {
$errors += $count;
}
}
return $total > 0 ? ($errors / $total) * 100 : 0;
}
public function shouldAlert() {
return $this->getErrorRate() > 10; // Alert if error rate > 10%
}
}
?>
Conclusion
Proper HTTP status code handling is fundamental to building reliable PHP web scrapers. By implementing comprehensive error handling, retry logic, and monitoring, you can create robust applications that gracefully handle the various scenarios encountered during web scraping. Remember to always respect robots.txt files and implement appropriate delays to avoid overwhelming target servers.
For complex scenarios involving dynamic content and JavaScript-heavy sites, consider combining PHP scraping with headless browser solutions, particularly when dealing with authentication flows or complex page interactions that require more sophisticated handling than traditional HTTP requests can provide.