When scraping multiple pages with Guzzle, performance optimization is crucial for handling large-scale operations efficiently. Here are proven strategies to dramatically improve your scraping speed and resource utilization:
1. Concurrent Requests with Promises
Performance Gain: 5-10x faster
Replace sequential requests with concurrent ones using Guzzle's promise-based async API:
use GuzzleHttp\Client;
use GuzzleHttp\Promise;
use GuzzleHttp\Exception\RequestException;
$client = new Client();
$urls = [
'http://example.com/page1',
'http://example.com/page2',
'http://example.com/page3',
// Add more URLs...
];
// Create async promises for all requests
$promises = [];
foreach ($urls as $url) {
$promises[] = $client->getAsync($url);
}
// Execute all requests concurrently
try {
$responses = Promise\settle($promises)->wait();
// Process successful responses
foreach ($responses as $index => $response) {
if ($response['state'] === 'fulfilled') {
$body = $response['value']->getBody()->getContents();
echo "Page {$index}: " . strlen($body) . " bytes\n";
} else {
echo "Failed to fetch page {$index}: " . $response['reason']->getMessage() . "\n";
}
}
} catch (Exception $e) {
echo "Error: " . $e->getMessage();
}
2. Connection Pooling for Scalability
Best for: 100+ concurrent requests
Use Guzzle's Pool class to manage concurrent connections efficiently:
use GuzzleHttp\Pool;
use GuzzleHttp\Client;
use GuzzleHttp\Psr7\Request;
$client = new Client([
'timeout' => 30,
'connect_timeout' => 10,
]);
$urls = [
'http://example.com/page1',
'http://example.com/page2',
// ... up to thousands of URLs
];
$requests = function ($urls) {
foreach ($urls as $url) {
yield new Request('GET', $url);
}
};
$pool = new Pool($client, $requests($urls), [
'concurrency' => 10, // Adjust based on server capacity
'fulfilled' => function ($response, $index) {
// Process successful response
$content = $response->getBody()->getContents();
file_put_contents("page_{$index}.html", $content);
},
'rejected' => function ($reason, $index) {
// Handle failed requests
echo "Request {$index} failed: " . $reason->getMessage() . "\n";
},
]);
$promise = $pool->promise();
$promise->wait();
3. Response Caching
Best for: Repeated requests to same URLs
Implement intelligent caching to avoid duplicate requests:
use GuzzleHttp\Client;
use GuzzleHttp\HandlerStack;
// Simple file-based cache implementation
class SimpleCache {
private $cacheDir;
public function __construct($cacheDir = './cache') {
$this->cacheDir = $cacheDir;
if (!is_dir($cacheDir)) {
mkdir($cacheDir, 0755, true);
}
}
public function get($key) {
$file = $this->cacheDir . '/' . md5($key) . '.cache';
if (file_exists($file) && (time() - filemtime($file)) < 3600) {
return unserialize(file_get_contents($file));
}
return null;
}
public function set($key, $value) {
$file = $this->cacheDir . '/' . md5($key) . '.cache';
file_put_contents($file, serialize($value));
}
}
$cache = new SimpleCache();
$client = new Client();
function fetchWithCache($url, $client, $cache) {
$cached = $cache->get($url);
if ($cached) {
return $cached;
}
$response = $client->get($url);
$content = $response->getBody()->getContents();
$cache->set($url, $content);
return $content;
}
4. Optimize Client Configuration
Critical settings for performance:
$client = new Client([
// Connection settings
'timeout' => 30, // Total request timeout
'connect_timeout' => 10, // Connection establishment timeout
'read_timeout' => 20, // Data read timeout
// HTTP settings
'http_version' => '2.0', // Use HTTP/2 for better performance
'allow_redirects' => [
'max' => 3, // Limit redirects
'strict' => true,
'referer' => true
],
// SSL settings
'verify' => true, // Keep SSL verification enabled
// Compression
'decode_content' => true, // Automatically decode gzipped content
// Headers
'headers' => [
'User-Agent' => 'Mozilla/5.0 (compatible; Scraper/1.0)',
'Accept-Encoding' => 'gzip, deflate',
'Accept' => '*/*',
'Connection' => 'keep-alive'
]
]);
5. Implement Rate Limiting
Prevent getting blocked while maintaining speed:
use GuzzleHttp\Client;
use GuzzleHttp\Handler\CurlMultiHandler;
use GuzzleHttp\HandlerStack;
class RateLimitMiddleware {
private $lastRequest = 0;
private $minDelay;
public function __construct($requestsPerSecond = 10) {
$this->minDelay = 1000000 / $requestsPerSecond; // microseconds
}
public function __invoke(callable $handler) {
return function ($request, array $options) use ($handler) {
$now = microtime(true) * 1000000;
$elapsed = $now - $this->lastRequest;
if ($elapsed < $this->minDelay) {
usleep($this->minDelay - $elapsed);
}
$this->lastRequest = microtime(true) * 1000000;
return $handler($request, $options);
};
}
}
$stack = HandlerStack::create();
$stack->push(new RateLimitMiddleware(5)); // 5 requests per second
$client = new Client(['handler' => $stack]);
6. Memory Management
Prevent memory leaks in long-running scripts:
use GuzzleHttp\Client;
use GuzzleHttp\Psr7\Response;
$client = new Client();
function processLargeDataset($urls) {
$batchSize = 50;
$batches = array_chunk($urls, $batchSize);
foreach ($batches as $batch) {
$promises = [];
foreach ($batch as $url) {
$promises[] = $client->getAsync($url);
}
$responses = Promise\settle($promises)->wait();
// Process responses immediately
foreach ($responses as $response) {
if ($response['state'] === 'fulfilled') {
$content = $response['value']->getBody()->getContents();
// Process content here
processContent($content);
}
}
// Clear variables to free memory
unset($promises, $responses);
// Optional: Force garbage collection
if (function_exists('gc_collect_cycles')) {
gc_collect_cycles();
}
// Small delay between batches
usleep(100000); // 0.1 seconds
}
}
7. Error Handling and Retry Logic
Robust error handling for production environments:
use GuzzleHttp\Client;
use GuzzleHttp\Exception\RequestException;
use GuzzleHttp\Exception\ConnectException;
function fetchWithRetry($url, $maxRetries = 3) {
$client = new Client(['timeout' => 30]);
for ($attempt = 1; $attempt <= $maxRetries; $attempt++) {
try {
$response = $client->get($url);
return $response->getBody()->getContents();
} catch (ConnectException $e) {
if ($attempt === $maxRetries) {
throw $e;
}
sleep(pow(2, $attempt)); // Exponential backoff
} catch (RequestException $e) {
if ($e->getResponse() && $e->getResponse()->getStatusCode() >= 500) {
if ($attempt === $maxRetries) {
throw $e;
}
sleep(pow(2, $attempt));
} else {
throw $e; // Don't retry client errors (4xx)
}
}
}
}
Performance Monitoring
Track your scraping performance:
$startTime = microtime(true);
$requestCount = 0;
// Your scraping code here...
$endTime = microtime(true);
$duration = $endTime - $startTime;
$requestsPerSecond = $requestCount / $duration;
echo "Scraped {$requestCount} pages in {$duration:.2f} seconds\n";
echo "Average: {$requestsPerSecond:.2f} requests/second\n";
Best Practices Summary
- Start with concurrency: Use promises or pools for 5-10x performance improvement
- Optimize concurrency level: Test different values (5-50) based on target server capacity
- Implement caching: Avoid duplicate requests to the same URLs
- Use HTTP/2: Better multiplexing and compression
- Monitor memory usage: Process data in batches for large datasets
- Respect rate limits: Implement delays to avoid getting blocked
- Handle errors gracefully: Implement retry logic with exponential backoff
By implementing these strategies, you can achieve significant performance improvements while maintaining stability and respecting target websites' resources.