How can I improve the performance of Guzzle when scraping multiple pages?

When scraping multiple pages with Guzzle, performance optimization is crucial for handling large-scale operations efficiently. Here are proven strategies to dramatically improve your scraping speed and resource utilization:

1. Concurrent Requests with Promises

Performance Gain: 5-10x faster

Replace sequential requests with concurrent ones using Guzzle's promise-based async API:

use GuzzleHttp\Client;
use GuzzleHttp\Promise;
use GuzzleHttp\Exception\RequestException;

$client = new Client();
$urls = [
    'http://example.com/page1',
    'http://example.com/page2',
    'http://example.com/page3',
    // Add more URLs...
];

// Create async promises for all requests
$promises = [];
foreach ($urls as $url) {
    $promises[] = $client->getAsync($url);
}

// Execute all requests concurrently
try {
    $responses = Promise\settle($promises)->wait();

    // Process successful responses
    foreach ($responses as $index => $response) {
        if ($response['state'] === 'fulfilled') {
            $body = $response['value']->getBody()->getContents();
            echo "Page {$index}: " . strlen($body) . " bytes\n";
        } else {
            echo "Failed to fetch page {$index}: " . $response['reason']->getMessage() . "\n";
        }
    }
} catch (Exception $e) {
    echo "Error: " . $e->getMessage();
}

2. Connection Pooling for Scalability

Best for: 100+ concurrent requests

Use Guzzle's Pool class to manage concurrent connections efficiently:

use GuzzleHttp\Pool;
use GuzzleHttp\Client;
use GuzzleHttp\Psr7\Request;

$client = new Client([
    'timeout' => 30,
    'connect_timeout' => 10,
]);

$urls = [
    'http://example.com/page1',
    'http://example.com/page2',
    // ... up to thousands of URLs
];

$requests = function ($urls) {
    foreach ($urls as $url) {
        yield new Request('GET', $url);
    }
};

$pool = new Pool($client, $requests($urls), [
    'concurrency' => 10, // Adjust based on server capacity
    'fulfilled' => function ($response, $index) {
        // Process successful response
        $content = $response->getBody()->getContents();
        file_put_contents("page_{$index}.html", $content);
    },
    'rejected' => function ($reason, $index) {
        // Handle failed requests
        echo "Request {$index} failed: " . $reason->getMessage() . "\n";
    },
]);

$promise = $pool->promise();
$promise->wait();

3. Response Caching

Best for: Repeated requests to same URLs

Implement intelligent caching to avoid duplicate requests:

use GuzzleHttp\Client;
use GuzzleHttp\HandlerStack;

// Simple file-based cache implementation
class SimpleCache {
    private $cacheDir;

    public function __construct($cacheDir = './cache') {
        $this->cacheDir = $cacheDir;
        if (!is_dir($cacheDir)) {
            mkdir($cacheDir, 0755, true);
        }
    }

    public function get($key) {
        $file = $this->cacheDir . '/' . md5($key) . '.cache';
        if (file_exists($file) && (time() - filemtime($file)) < 3600) {
            return unserialize(file_get_contents($file));
        }
        return null;
    }

    public function set($key, $value) {
        $file = $this->cacheDir . '/' . md5($key) . '.cache';
        file_put_contents($file, serialize($value));
    }
}

$cache = new SimpleCache();
$client = new Client();

function fetchWithCache($url, $client, $cache) {
    $cached = $cache->get($url);
    if ($cached) {
        return $cached;
    }

    $response = $client->get($url);
    $content = $response->getBody()->getContents();
    $cache->set($url, $content);

    return $content;
}

4. Optimize Client Configuration

Critical settings for performance:

$client = new Client([
    // Connection settings
    'timeout' => 30,              // Total request timeout
    'connect_timeout' => 10,      // Connection establishment timeout
    'read_timeout' => 20,         // Data read timeout

    // HTTP settings
    'http_version' => '2.0',      // Use HTTP/2 for better performance
    'allow_redirects' => [
        'max' => 3,               // Limit redirects
        'strict' => true,
        'referer' => true
    ],

    // SSL settings
    'verify' => true,             // Keep SSL verification enabled

    // Compression
    'decode_content' => true,     // Automatically decode gzipped content

    // Headers
    'headers' => [
        'User-Agent' => 'Mozilla/5.0 (compatible; Scraper/1.0)',
        'Accept-Encoding' => 'gzip, deflate',
        'Accept' => '*/*',
        'Connection' => 'keep-alive'
    ]
]);

5. Implement Rate Limiting

Prevent getting blocked while maintaining speed:

use GuzzleHttp\Client;
use GuzzleHttp\Handler\CurlMultiHandler;
use GuzzleHttp\HandlerStack;

class RateLimitMiddleware {
    private $lastRequest = 0;
    private $minDelay;

    public function __construct($requestsPerSecond = 10) {
        $this->minDelay = 1000000 / $requestsPerSecond; // microseconds
    }

    public function __invoke(callable $handler) {
        return function ($request, array $options) use ($handler) {
            $now = microtime(true) * 1000000;
            $elapsed = $now - $this->lastRequest;

            if ($elapsed < $this->minDelay) {
                usleep($this->minDelay - $elapsed);
            }

            $this->lastRequest = microtime(true) * 1000000;
            return $handler($request, $options);
        };
    }
}

$stack = HandlerStack::create();
$stack->push(new RateLimitMiddleware(5)); // 5 requests per second

$client = new Client(['handler' => $stack]);

6. Memory Management

Prevent memory leaks in long-running scripts:

use GuzzleHttp\Client;
use GuzzleHttp\Psr7\Response;

$client = new Client();

function processLargeDataset($urls) {
    $batchSize = 50;
    $batches = array_chunk($urls, $batchSize);

    foreach ($batches as $batch) {
        $promises = [];

        foreach ($batch as $url) {
            $promises[] = $client->getAsync($url);
        }

        $responses = Promise\settle($promises)->wait();

        // Process responses immediately
        foreach ($responses as $response) {
            if ($response['state'] === 'fulfilled') {
                $content = $response['value']->getBody()->getContents();
                // Process content here
                processContent($content);
            }
        }

        // Clear variables to free memory
        unset($promises, $responses);

        // Optional: Force garbage collection
        if (function_exists('gc_collect_cycles')) {
            gc_collect_cycles();
        }

        // Small delay between batches
        usleep(100000); // 0.1 seconds
    }
}

7. Error Handling and Retry Logic

Robust error handling for production environments:

use GuzzleHttp\Client;
use GuzzleHttp\Exception\RequestException;
use GuzzleHttp\Exception\ConnectException;

function fetchWithRetry($url, $maxRetries = 3) {
    $client = new Client(['timeout' => 30]);

    for ($attempt = 1; $attempt <= $maxRetries; $attempt++) {
        try {
            $response = $client->get($url);
            return $response->getBody()->getContents();

        } catch (ConnectException $e) {
            if ($attempt === $maxRetries) {
                throw $e;
            }
            sleep(pow(2, $attempt)); // Exponential backoff

        } catch (RequestException $e) {
            if ($e->getResponse() && $e->getResponse()->getStatusCode() >= 500) {
                if ($attempt === $maxRetries) {
                    throw $e;
                }
                sleep(pow(2, $attempt));
            } else {
                throw $e; // Don't retry client errors (4xx)
            }
        }
    }
}

Performance Monitoring

Track your scraping performance:

$startTime = microtime(true);
$requestCount = 0;

// Your scraping code here...

$endTime = microtime(true);
$duration = $endTime - $startTime;
$requestsPerSecond = $requestCount / $duration;

echo "Scraped {$requestCount} pages in {$duration:.2f} seconds\n";
echo "Average: {$requestsPerSecond:.2f} requests/second\n";

Best Practices Summary

  1. Start with concurrency: Use promises or pools for 5-10x performance improvement
  2. Optimize concurrency level: Test different values (5-50) based on target server capacity
  3. Implement caching: Avoid duplicate requests to the same URLs
  4. Use HTTP/2: Better multiplexing and compression
  5. Monitor memory usage: Process data in batches for large datasets
  6. Respect rate limits: Implement delays to avoid getting blocked
  7. Handle errors gracefully: Implement retry logic with exponential backoff

By implementing these strategies, you can achieve significant performance improvements while maintaining stability and respecting target websites' resources.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon