What is Guzzle?
Guzzle is a powerful PHP HTTP client library that simplifies making HTTP requests and integrating with web services. While not exclusively a web scraping tool, Guzzle is widely used as the foundation for web scraping projects in PHP.
Key Features: - Simple, intuitive API for HTTP requests - Comprehensive error handling - Built-in cookie management - Request/response middleware - Asynchronous request support - Stream handling for large files
Installing Guzzle
Install Guzzle via Composer:
composer require guzzlehttp/guzzle
Basic Web Scraping with Guzzle
1. Simple GET Request
<?php
require 'vendor/autoload.php';
use GuzzleHttp\Client;
use GuzzleHttp\Exception\RequestException;
$client = new Client();
try {
$response = $client->request('GET', 'https://example.com');
$html = $response->getBody()->getContents();
echo "Status: " . $response->getStatusCode() . "\n";
echo "Content: " . substr($html, 0, 200) . "...\n";
} catch (RequestException $e) {
echo "Error: " . $e->getMessage() . "\n";
}
2. Setting Headers and User Agent
<?php
$client = new Client();
$options = [
'headers' => [
'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language' => 'en-US,en;q=0.5',
'Accept-Encoding' => 'gzip, deflate',
'Referer' => 'https://google.com'
],
'timeout' => 30
];
try {
$response = $client->request('GET', 'https://example.com', $options);
$html = $response->getBody()->getContents();
} catch (RequestException $e) {
echo "Request failed: " . $e->getMessage();
}
3. Handling Cookies
<?php
use GuzzleHttp\Cookie\CookieJar;
$cookieJar = new CookieJar();
$client = new Client(['cookies' => $cookieJar]);
// First request - cookies will be stored
$response1 = $client->request('GET', 'https://example.com/login');
// Second request - cookies will be sent automatically
$response2 = $client->request('GET', 'https://example.com/dashboard');
4. POST Requests with Form Data
<?php
$client = new Client();
$formData = [
'username' => 'your_username',
'password' => 'your_password'
];
try {
$response = $client->request('POST', 'https://example.com/login', [
'form_params' => $formData,
'allow_redirects' => true
]);
$html = $response->getBody()->getContents();
} catch (RequestException $e) {
echo "Login failed: " . $e->getMessage();
}
Parsing HTML with Guzzle and DOMCrawler
Combine Guzzle with Symfony's DOMCrawler for powerful HTML parsing:
composer require symfony/dom-crawler symfony/css-selector
<?php
require 'vendor/autoload.php';
use GuzzleHttp\Client;
use Symfony\Component\DomCrawler\Crawler;
$client = new Client();
$response = $client->request('GET', 'https://news.ycombinator.com');
$html = $response->getBody()->getContents();
$crawler = new Crawler($html);
// Extract all story titles
$titles = $crawler->filter('.storylink')->each(function (Crawler $node) {
return [
'title' => $node->text(),
'url' => $node->attr('href')
];
});
foreach ($titles as $story) {
echo $story['title'] . " - " . $story['url'] . "\n";
}
Advanced Features
1. Asynchronous Requests
<?php
use GuzzleHttp\Client;
use GuzzleHttp\Promise;
$client = new Client();
$urls = [
'https://example.com/page1',
'https://example.com/page2',
'https://example.com/page3'
];
// Create promises for concurrent requests
$promises = [];
foreach ($urls as $url) {
$promises[] = $client->getAsync($url);
}
// Wait for all requests to complete
$responses = Promise\settle($promises)->wait();
foreach ($responses as $i => $response) {
if ($response['state'] === 'fulfilled') {
echo "URL {$urls[$i]}: " . $response['value']->getStatusCode() . "\n";
} else {
echo "URL {$urls[$i]} failed: " . $response['reason']->getMessage() . "\n";
}
}
2. Handling Different Response Types
<?php
$client = new Client();
try {
$response = $client->request('GET', 'https://api.example.com/data.json');
$contentType = $response->getHeader('Content-Type')[0];
if (str_contains($contentType, 'application/json')) {
$data = json_decode($response->getBody(), true);
print_r($data);
} elseif (str_contains($contentType, 'text/html')) {
$html = $response->getBody()->getContents();
// Parse HTML...
}
} catch (RequestException $e) {
echo "Error: " . $e->getMessage();
}
3. Rate Limiting and Delays
<?php
$client = new Client();
$urls = ['url1', 'url2', 'url3'];
foreach ($urls as $url) {
try {
$response = $client->request('GET', $url);
// Process response...
// Add delay to be respectful
sleep(1);
} catch (RequestException $e) {
echo "Failed to fetch $url: " . $e->getMessage() . "\n";
}
}
Best Practices for Web Scraping
- Respect robots.txt: Always check the website's robots.txt file
- Use appropriate delays: Don't overwhelm servers with rapid requests
- Handle errors gracefully: Implement retry logic with exponential backoff
- Set realistic timeouts: Prevent hanging requests
- Rotate User-Agents: Vary your request headers
- Respect rate limits: Monitor response headers for rate limiting info
- Cache responses: Store frequently accessed data locally
Error Handling Example
<?php
use GuzzleHttp\Exception\RequestException;
use GuzzleHttp\Exception\ConnectException;
function scrapeWithRetry($client, $url, $maxRetries = 3) {
$attempt = 0;
while ($attempt < $maxRetries) {
try {
$response = $client->request('GET', $url, [
'timeout' => 30,
'connect_timeout' => 10
]);
return $response->getBody()->getContents();
} catch (ConnectException $e) {
echo "Connection failed (attempt " . ($attempt + 1) . "): " . $e->getMessage() . "\n";
} catch (RequestException $e) {
if ($e->getResponse() && $e->getResponse()->getStatusCode() === 429) {
echo "Rate limited, waiting before retry...\n";
sleep(60); // Wait 1 minute for rate limit reset
} else {
echo "Request failed: " . $e->getMessage() . "\n";
}
}
$attempt++;
if ($attempt < $maxRetries) {
sleep(pow(2, $attempt)); // Exponential backoff
}
}
throw new Exception("Failed to fetch $url after $maxRetries attempts");
}
Conclusion
Guzzle provides a robust foundation for web scraping in PHP. While it handles the HTTP communication layer, you'll typically combine it with HTML parsing libraries like DOMCrawler or simple_html_dom for complete scraping solutions. Always scrape responsibly by respecting website terms of service, implementing rate limiting, and following ethical scraping practices.