Goutte is a screen scraping and web crawling library for PHP. When using Goutte for web scraping, managing cookies is essential for maintaining sessions, handling authentication, and preserving state across different requests.
Here's how you can manage cookies in a Goutte web scraping session:
Setting Cookies
To set a cookie manually in Goutte, you can use the setCookie
method on the CookieJar
associated with the client. Here's an example:
use Goutte\Client;
$client = new Client();
$cookieJar = $client->getCookieJar();
$cookieJar->set(new \Symfony\Component\BrowserKit\Cookie('cookie_name', 'cookie_value'));
// Now, when you make a request, the cookie will be included
$crawler = $client->request('GET', 'http://example.com');
Getting Cookies
To retrieve cookies that have been set in a Goutte session, you can again use the CookieJar
:
$cookies = $cookieJar->all();
foreach ($cookies as $cookie) {
echo $cookie->getName() . ': ' . $cookie->getValue() . "\n";
}
Using Cookies from a Response
After you make a request, cookies may be set by the server. Goutte automatically handles these cookies for you. However, if you need to inspect them, you can do so like this:
$crawler = $client->request('GET', 'http://example.com');
$responseCookies = $client->getCookieJar()->all();
foreach ($responseCookies as $cookie) {
echo $cookie->getName() . ': ' . $cookie->getValue() . "\n";
}
Clearing Cookies
To clear all cookies, you can use the clear
method:
$cookieJar->clear();
If you want to clear cookies for a specific domain:
$cookieJar->clear('example.com');
Example of a Session with Cookies
Here's a more concrete example where we log in to a website using Goutte, and we need to maintain the session cookies:
$client = new Client();
// Go to the login page
$crawler = $client->request('GET', 'http://example.com/login');
// Submit the login form with credentials
$form = $crawler->selectButton('Login')->form();
$crawler = $client->submit($form, array('username' => 'user', 'password' => 'pass'));
// Now, you are logged in, and the session cookie is set
// You can make further requests that require authentication
$crawler = $client->request('GET', 'http://example.com/protected-page');
// Cookies are handled automatically, but you can still access them if needed
$cookies = $client->getCookieJar()->all();
// Do something with the cookies
In the above example, the session cookies that are set upon login will be used for subsequent requests, allowing you to access pages that require authentication.
Remember that while managing cookies is important for many scraping tasks, you should always ensure that you are in compliance with the website's terms of service and applicable laws regarding data privacy and protection.