Guzzle is a PHP HTTP client that simplifies HTTP requests and web service integration. When web scraping, handling redirects properly is crucial since many websites use redirects for various purposes like URL canonicalization, load balancing, or protocol switching (HTTP to HTTPS).
By default, Guzzle automatically follows redirects up to 5 times before stopping. This behavior can be customized using the allow_redirects
option.
Default Redirect Behavior
use GuzzleHttp\Client;
$client = new Client();
// By default, Guzzle follows up to 5 redirects automatically
$response = $client->request('GET', 'http://example.com/redirect-me');
echo $response->getBody(); // Content from final destination
Configuring Redirect Behavior
Basic Configuration
// Disable redirects completely
$response = $client->request('GET', 'http://example.com', [
'allow_redirects' => false
]);
// Enable redirects with default settings (equivalent to true)
$response = $client->request('GET', 'http://example.com', [
'allow_redirects' => true
]);
Advanced Redirect Configuration
$response = $client->request('GET', 'http://example.com', [
'allow_redirects' => [
'max' => 10, // Maximum redirects (default: 5)
'strict' => true, // RFC compliant redirects
'referer' => true, // Add Referer header
'protocols' => ['https'], // Allowed protocols
'track_redirects' => true // Track redirect history
]
]);
Redirect Options Explained
| Option | Type | Description |
|--------|------|-------------|
| max
| integer | Maximum number of redirects to follow (default: 5) |
| strict
| boolean | Use strict RFC-compliant redirects. When true
, POST requests maintain their method through redirects |
| referer
| boolean | Add Referer header when following redirects |
| protocols
| array | Allowed protocols for redirects (e.g., ['http', 'https']
) |
| track_redirects
| boolean | Track redirect history in response headers |
Tracking Redirect History
When track_redirects
is enabled, Guzzle adds special headers to track the redirect chain:
$response = $client->request('GET', 'http://httpbin.org/redirect/3', [
'allow_redirects' => [
'max' => 10,
'track_redirects' => true
]
]);
// Check if redirects occurred
if ($response->hasHeader('X-Guzzle-Redirect-History')) {
$redirectUrls = $response->getHeader('X-Guzzle-Redirect-History');
$redirectCodes = $response->getHeader('X-Guzzle-Redirect-Status-History');
echo "Redirect chain:\n";
foreach ($redirectUrls as $index => $url) {
$statusCode = $redirectCodes[$index] ?? 'Unknown';
echo sprintf("%d. %s (Status: %s)\n", $index + 1, $url, $statusCode);
}
echo "Final URL: " . (string) $response->getHeaderLine('X-Guzzle-Effective-Url') . "\n";
}
Manual Redirect Handling
For complete control over redirects, disable automatic following and handle them manually:
use GuzzleHttp\Exception\RequestException;
function followRedirectsManually($client, $url, $maxRedirects = 5) {
$redirectCount = 0;
do {
try {
$response = $client->request('GET', $url, [
'allow_redirects' => false
]);
$statusCode = $response->getStatusCode();
// Check if it's a redirect status code
if (in_array($statusCode, [301, 302, 303, 307, 308])) {
if ($redirectCount >= $maxRedirects) {
throw new \Exception("Too many redirects");
}
$location = $response->getHeaderLine('Location');
if (empty($location)) {
throw new \Exception("Redirect without Location header");
}
echo "Redirecting from {$url} to {$location} (Status: {$statusCode})\n";
// Handle relative URLs
if (!filter_var($location, FILTER_VALIDATE_URL)) {
$parsedUrl = parse_url($url);
$baseUrl = $parsedUrl['scheme'] . '://' . $parsedUrl['host'];
$location = $baseUrl . '/' . ltrim($location, '/');
}
$url = $location;
$redirectCount++;
} else {
// Not a redirect, return the response
return $response;
}
} catch (RequestException $e) {
throw new \Exception("Request failed: " . $e->getMessage());
}
} while ($redirectCount < $maxRedirects);
throw new \Exception("Maximum redirects exceeded");
}
// Usage
$client = new Client();
$response = followRedirectsManually($client, 'http://httpbin.org/redirect/3');
echo $response->getBody();
Common Web Scraping Scenarios
Handling HTTPS Redirects
Many sites redirect HTTP to HTTPS. Configure Guzzle to handle this securely:
$response = $client->request('GET', 'http://example.com', [
'allow_redirects' => [
'max' => 3,
'protocols' => ['https'], // Only allow HTTPS redirects
'strict' => true
]
]);
Preventing Infinite Redirects
Some misconfigured sites can cause redirect loops. Protect against this:
$response = $client->request('GET', 'http://example.com', [
'allow_redirects' => [
'max' => 3, // Low limit to prevent loops
'track_redirects' => true
],
'timeout' => 10 // Overall request timeout
]);
Preserving POST Data on Redirects
Handle form submissions that might redirect:
$response = $client->request('POST', 'http://example.com/form', [
'form_params' => [
'username' => 'user',
'password' => 'pass'
],
'allow_redirects' => [
'max' => 2,
'strict' => false, // Allow POST to GET conversions
'referer' => true // Maintain referer header
]
]);
Error Handling
use GuzzleHttp\Exception\TooManyRedirectsException;
use GuzzleHttp\Exception\RequestException;
try {
$response = $client->request('GET', 'http://example.com', [
'allow_redirects' => ['max' => 3]
]);
} catch (TooManyRedirectsException $e) {
echo "Too many redirects: " . $e->getMessage();
} catch (RequestException $e) {
echo "Request failed: " . $e->getMessage();
}
Best Practices
- Set reasonable redirect limits - Use
max
between 3-10 to prevent infinite loops - Use HTTPS-only for security - Set
protocols => ['https']
for sensitive operations - Track redirects for debugging - Enable
track_redirects
during development - Handle relative URLs - When manually following redirects, resolve relative URLs properly
- Implement timeouts - Always set request timeouts to prevent hanging
- Log redirect chains - For debugging, log the complete redirect path
These techniques provide comprehensive control over redirect handling in Guzzle, essential for robust web scraping applications.