Table of contents

What is the best way to deal with rate limits in Guzzle while scraping?

Rate limiting is a critical aspect of responsible web scraping with Guzzle. Servers impose rate limits to control traffic and prevent abuse, typically returning HTTP 429 (Too Many Requests) status codes when limits are exceeded. Here are proven strategies to handle rate limits effectively while maintaining scraping efficiency.

1. Request Throttling

The simplest approach is to introduce delays between requests to stay within acceptable limits. Calculate the optimal sleep time based on the target's rate limit policy.

<?php
$client = new GuzzleHttp\Client();
$requestsPerMinute = 10; // Adjust based on target's rate limit
$sleepTime = 60 / $requestsPerMinute; // 6 seconds between requests

foreach ($urls as $url) {
    try {
        $response = $client->request('GET', $url);
        // Process the response...
        echo "Successfully scraped: " . $url . "\n";

        // Throttle requests to respect rate limits
        sleep($sleepTime);
    } catch (Exception $e) {
        echo "Error: " . $e->getMessage() . "\n";
    }
}

2. Middleware for Automatic Rate Limit Handling

Guzzle's middleware system provides a powerful way to handle rate limits automatically. This middleware detects 429 responses and implements retry logic with proper delays.

<?php
use GuzzleHttp\HandlerStack;
use GuzzleHttp\Middleware;
use GuzzleHttp\Psr7\Response;
use GuzzleHttp\Exception\RequestException;

$stack = HandlerStack::create();

// Add retry middleware with rate limit handling
$stack->push(Middleware::retry(function ($retry, $request, $response, $exception) {
    // Don't retry if we've exceeded max attempts
    if ($retry >= 3) {
        return false;
    }

    // Handle rate limit responses
    if ($response instanceof Response && $response->getStatusCode() === 429) {
        $retryAfter = $response->getHeaderLine('Retry-After');

        if ($retryAfter) {
            // If Retry-After is a timestamp, convert to seconds
            if (is_numeric($retryAfter) && $retryAfter > time()) {
                $retryAfter = $retryAfter - time();
            }

            echo "Rate limited. Waiting {$retryAfter} seconds...\n";
            sleep((int)$retryAfter);
        } else {
            // Default exponential backoff if no Retry-After header
            $waitTime = pow(2, $retry) * 1; // 1, 2, 4 seconds
            echo "Rate limited. Waiting {$waitTime} seconds...\n";
            sleep($waitTime);
        }

        return true; // Retry the request
    }

    // Retry on connection errors
    if ($exception instanceof RequestException) {
        $waitTime = pow(2, $retry) * 1;
        sleep($waitTime);
        return true;
    }

    return false;
}));

$client = new GuzzleHttp\Client(['handler' => $stack]);

3. Dynamic Rate Limit Monitoring

Monitor response headers to dynamically adjust your scraping speed based on real-time rate limit information.

<?php
class RateLimitMonitor {
    private $client;
    private $lastRequestTime;
    private $remainingRequests;
    private $resetTime;

    public function __construct() {
        $this->client = new GuzzleHttp\Client();
        $this->lastRequestTime = 0;
    }

    public function makeRequest($url) {
        // Check if we need to wait before making another request
        $this->waitIfNecessary();

        $response = $this->client->request('GET', $url);
        $this->updateRateLimitInfo($response);

        return $response;
    }

    private function updateRateLimitInfo($response) {
        $headers = $response->getHeaders();

        // Common rate limit headers
        $this->remainingRequests = $headers['X-RateLimit-Remaining'][0] ?? 
                                   $headers['X-Rate-Limit-Remaining'][0] ?? 
                                   null;

        $this->resetTime = $headers['X-RateLimit-Reset'][0] ?? 
                          $headers['X-Rate-Limit-Reset'][0] ?? 
                          null;

        $this->lastRequestTime = time();

        echo "Remaining requests: {$this->remainingRequests}\n";
    }

    private function waitIfNecessary() {
        if ($this->remainingRequests !== null && $this->remainingRequests < 2) {
            if ($this->resetTime) {
                $waitTime = max(0, $this->resetTime - time());
                echo "Low rate limit remaining. Waiting {$waitTime} seconds...\n";
                sleep($waitTime);
            }
        }
    }
}

// Usage
$monitor = new RateLimitMonitor();
foreach ($urls as $url) {
    try {
        $response = $monitor->makeRequest($url);
        // Process response...
    } catch (Exception $e) {
        echo "Error: " . $e->getMessage() . "\n";
    }
}

4. Exponential Backoff Strategy

Implement a robust exponential backoff strategy that increases wait times progressively when rate limits are hit.

<?php
class ExponentialBackoff {
    private $client;
    private $maxRetries;
    private $baseDelay;

    public function __construct($maxRetries = 5, $baseDelay = 1) {
        $this->client = new GuzzleHttp\Client();
        $this->maxRetries = $maxRetries;
        $this->baseDelay = $baseDelay;
    }

    public function requestWithBackoff($url) {
        $attempt = 0;

        while ($attempt < $this->maxRetries) {
            try {
                $response = $this->client->request('GET', $url);

                // Reset on success
                return $response;

            } catch (GuzzleHttp\Exception\ClientException $e) {
                if ($e->getResponse()->getStatusCode() === 429) {
                    $attempt++;

                    if ($attempt >= $this->maxRetries) {
                        throw new Exception("Max retries exceeded for rate limits");
                    }

                    // Exponential backoff: 1, 2, 4, 8, 16 seconds
                    $waitTime = $this->baseDelay * pow(2, $attempt - 1);

                    // Add jitter to prevent thundering herd
                    $jitter = mt_rand(0, 1000) / 1000; // 0-1 second
                    $waitTime += $jitter;

                    echo "Rate limited. Attempt {$attempt}, waiting {$waitTime} seconds...\n";
                    sleep($waitTime);

                } else {
                    throw $e; // Re-throw non-rate-limit errors
                }
            }
        }
    }
}

// Usage
$backoff = new ExponentialBackoff();
foreach ($urls as $url) {
    try {
        $response = $backoff->requestWithBackoff($url);
        echo "Success: " . $url . "\n";
    } catch (Exception $e) {
        echo "Failed: " . $url . " - " . $e->getMessage() . "\n";
    }
}

5. Advanced: Concurrent Pool with Rate Limiting

For high-volume scraping, use Guzzle's concurrent requests with built-in rate limiting.

<?php
use GuzzleHttp\Pool;
use GuzzleHttp\Psr7\Request;

$client = new GuzzleHttp\Client();
$requests = function ($urls) {
    foreach ($urls as $url) {
        yield new Request('GET', $url);
    }
};

$pool = new Pool($client, $requests($urls), [
    'concurrency' => 5, // Limit concurrent requests
    'fulfilled' => function ($response, $index) {
        echo "Request {$index} completed\n";
    },
    'rejected' => function ($reason, $index) {
        if ($reason->getResponse() && $reason->getResponse()->getStatusCode() === 429) {
            echo "Request {$index} rate limited\n";
            // Could implement queue for retry
        }
    },
]);

// Add delay between batches
$promise = $pool->promise();
$promise->wait();
sleep(2); // Delay between batches

6. Queue-Based Approach

For enterprise-level scraping, implement a queue system to manage request rates across multiple workers.

<?php
class RateLimitedQueue {
    private $queue;
    private $requestsPerSecond;
    private $lastRequestTime;

    public function __construct($requestsPerSecond = 1) {
        $this->queue = [];
        $this->requestsPerSecond = $requestsPerSecond;
        $this->lastRequestTime = 0;
    }

    public function addUrl($url) {
        $this->queue[] = $url;
    }

    public function processQueue() {
        $client = new GuzzleHttp\Client();

        while (!empty($this->queue)) {
            $url = array_shift($this->queue);

            // Ensure we don't exceed rate limit
            $timeSinceLastRequest = microtime(true) - $this->lastRequestTime;
            $minInterval = 1 / $this->requestsPerSecond;

            if ($timeSinceLastRequest < $minInterval) {
                $sleepTime = $minInterval - $timeSinceLastRequest;
                usleep($sleepTime * 1000000); // Convert to microseconds
            }

            try {
                $response = $client->request('GET', $url);
                $this->lastRequestTime = microtime(true);

                echo "Processed: {$url}\n";

            } catch (Exception $e) {
                echo "Error processing {$url}: " . $e->getMessage() . "\n";

                // Re-queue on rate limit
                if (strpos($e->getMessage(), '429') !== false) {
                    $this->queue[] = $url; // Add back to queue
                    sleep(60); // Wait before retrying
                }
            }
        }
    }
}

Best Practices Summary

  1. Start Conservative: Begin with generous delays and gradually optimize
  2. Monitor Headers: Always check for rate limit headers in responses
  3. Implement Retries: Use exponential backoff with jitter for failed requests
  4. Respect Retry-After: Honor server-provided retry delays
  5. Log Everything: Track rate limit hits and adjust strategies accordingly
  6. Use Middleware: Leverage Guzzle's middleware for automatic handling
  7. Test Thoroughly: Validate your rate limiting under various conditions

Remember to always check the target website's robots.txt and terms of service to ensure compliance with their scraping policies.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon