Yes, Goutte, which is a screen scraping and web crawling library for PHP, can handle redirects automatically. Goutte uses Guzzle as its HTTP client, and by default, Guzzle follows redirects up to a certain limit.
When you make a request with Goutte, if the response is a redirect (typically with status codes like 301, 302, 303, 307, or 308), Guzzle will automatically follow the redirect unless you've explicitly disabled this behavior.
Here's an example of using Goutte to make a request that might involve redirects:
require 'vendor/autoload.php';
use Goutte\Client;
$client = new Client();
// By default, redirects will be followed.
$crawler = $client->request('GET', 'http://example.com/some-redirecting-url');
// You can access the status code of the final response (after all redirects).
$status_code = $client->getResponse()->getStatus();
// You can also access the final URL after all redirects.
$final_url = $client->getRequest()->getUri();
echo "Final URL: $final_url\n";
echo "Status code: $status_code\n";
If for some reason you want to disable automatic redirect following, you can configure the Guzzle client used by Goutte as follows:
$client = new Client();
$guzzleClient = new \GuzzleHttp\Client([
'allow_redirects' => false, // Disable redirects
]);
$client->setClient($guzzleClient);
// Now redirects will not be followed.
$crawler = $client->request('GET', 'http://example.com/some-redirecting-url');
Keep in mind that if you disable redirects, the $crawler
object will contain the contents of the initial response, which is a redirect response, not the page you might be expecting after the redirect. Additionally, if you need to handle redirects manually for some reason, you can inspect the response headers and status code, and then make a new request to the URL specified in the Location
header of the response.
Always remember to respect the robots.txt
file and terms of service of the websites you are scraping, and ensure you are not violating any laws or regulations related to web scraping.