Table of contents

How Can I Implement Rate Limiting in PHP Web Scraping Scripts?

Rate limiting is a crucial technique in web scraping that controls the frequency of requests sent to a target website. Implementing proper rate limiting helps you avoid being blocked, reduces server load on the target site, and ensures your scraping activities remain respectful and sustainable.

Why Rate Limiting is Essential

Rate limiting serves several important purposes:

  • Prevents IP blocking: Websites often block IPs that make too many requests in a short time
  • Respects server resources: Reduces load on target websites
  • Improves stability: Prevents overwhelming your own system with concurrent requests
  • Legal compliance: Demonstrates good faith efforts to scrape responsibly
  • Better success rates: Steady, controlled requests often yield more reliable results

Basic Rate Limiting with PHP sleep() Function

The simplest approach to rate limiting in PHP is using the built-in sleep() function:

<?php
// Basic rate limiting with sleep
function scrapeWithDelay($urls, $delaySeconds = 1) {
    $results = [];

    foreach ($urls as $url) {
        // Make the request
        $content = file_get_contents($url);
        $results[] = $content;

        // Add delay between requests
        sleep($delaySeconds);

        echo "Scraped: $url (waiting {$delaySeconds}s)\n";
    }

    return $results;
}

$urls = [
    'https://example.com/page1',
    'https://example.com/page2',
    'https://example.com/page3'
];

scrapeWithDelay($urls, 2); // 2-second delay between requests
?>

For more precise timing, use usleep() for microsecond delays:

<?php
// Microsecond precision rate limiting
function preciseDelay($milliseconds) {
    usleep($milliseconds * 1000); // Convert to microseconds
}

foreach ($urls as $url) {
    $content = file_get_contents($url);
    preciseDelay(500); // 500ms delay
}
?>

Advanced Rate Limiting with cURL and Custom Headers

When using cURL for more control over HTTP requests, you can implement sophisticated rate limiting:

<?php
class RateLimitedScraper {
    private $requests = [];
    private $maxRequestsPerMinute;
    private $lastRequestTime = 0;
    private $minDelay;

    public function __construct($maxRequestsPerMinute = 60) {
        $this->maxRequestsPerMinute = $maxRequestsPerMinute;
        $this->minDelay = 60 / $maxRequestsPerMinute; // Seconds between requests
    }

    public function makeRequest($url, $options = []) {
        $this->enforceRateLimit();

        $ch = curl_init();
        curl_setopt_array($ch, [
            CURLOPT_URL => $url,
            CURLOPT_RETURNTRANSFER => true,
            CURLOPT_FOLLOWLOCATION => true,
            CURLOPT_USERAGENT => 'Mozilla/5.0 (compatible; PHP Scraper)',
            CURLOPT_TIMEOUT => 30,
        ] + $options);

        $response = curl_exec($ch);
        $httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
        curl_close($ch);

        $this->logRequest($url, $httpCode);

        return $response;
    }

    private function enforceRateLimit() {
        $currentTime = microtime(true);
        $timeSinceLastRequest = $currentTime - $this->lastRequestTime;

        if ($timeSinceLastRequest < $this->minDelay) {
            $sleepTime = $this->minDelay - $timeSinceLastRequest;
            usleep($sleepTime * 1000000); // Convert to microseconds
        }

        $this->lastRequestTime = microtime(true);
    }

    private function logRequest($url, $httpCode) {
        $this->requests[] = [
            'url' => $url,
            'timestamp' => time(),
            'http_code' => $httpCode
        ];

        // Clean old requests (older than 1 minute)
        $this->requests = array_filter($this->requests, function($req) {
            return time() - $req['timestamp'] < 60;
        });
    }

    public function getRequestCount() {
        return count($this->requests);
    }
}

// Usage example
$scraper = new RateLimitedScraper(30); // 30 requests per minute

$urls = [
    'https://example.com/api/data1',
    'https://example.com/api/data2',
    'https://example.com/api/data3'
];

foreach ($urls as $url) {
    $response = $scraper->makeRequest($url);
    echo "Current request count: " . $scraper->getRequestCount() . "\n";
}
?>

Implementing Token Bucket Algorithm

The token bucket algorithm provides more flexible rate limiting by allowing bursts while maintaining overall rate limits:

<?php
class TokenBucket {
    private $capacity;
    private $tokens;
    private $refillRate;
    private $lastRefill;

    public function __construct($capacity, $refillRate) {
        $this->capacity = $capacity;
        $this->tokens = $capacity;
        $this->refillRate = $refillRate; // tokens per second
        $this->lastRefill = microtime(true);
    }

    public function consume($tokens = 1) {
        $this->refill();

        if ($this->tokens >= $tokens) {
            $this->tokens -= $tokens;
            return true;
        }

        return false;
    }

    private function refill() {
        $now = microtime(true);
        $timePassed = $now - $this->lastRefill;
        $tokensToAdd = $timePassed * $this->refillRate;

        $this->tokens = min($this->capacity, $this->tokens + $tokensToAdd);
        $this->lastRefill = $now;
    }

    public function waitForTokens($tokens = 1) {
        while (!$this->consume($tokens)) {
            usleep(100000); // Wait 100ms before trying again
        }
    }
}

class TokenBucketScraper {
    private $tokenBucket;

    public function __construct($requestsPerSecond = 1, $burstCapacity = 5) {
        $this->tokenBucket = new TokenBucket($burstCapacity, $requestsPerSecond);
    }

    public function scrapeUrl($url) {
        $this->tokenBucket->waitForTokens();

        $ch = curl_init();
        curl_setopt_array($ch, [
            CURLOPT_URL => $url,
            CURLOPT_RETURNTRANSFER => true,
            CURLOPT_TIMEOUT => 30,
        ]);

        $response = curl_exec($ch);
        curl_close($ch);

        return $response;
    }
}

// Usage
$scraper = new TokenBucketScraper(2, 10); // 2 requests/second, burst of 10

$urls = array_fill(0, 20, 'https://httpbin.org/delay/1');

foreach ($urls as $index => $url) {
    $start = microtime(true);
    $response = $scraper->scrapeUrl($url);
    $duration = microtime(true) - $start;

    echo "Request $index completed in " . round($duration, 2) . "s\n";
}
?>

Database-Backed Rate Limiting

For more persistent and distributed rate limiting, use a database to track requests:

<?php
class DatabaseRateLimiter {
    private $pdo;
    private $identifier;

    public function __construct($pdo, $identifier = null) {
        $this->pdo = $pdo;
        $this->identifier = $identifier ?: $_SERVER['REMOTE_ADDR'] ?? 'default';

        // Create table if not exists
        $this->pdo->exec("
            CREATE TABLE IF NOT EXISTS rate_limits (
                identifier VARCHAR(255) PRIMARY KEY,
                request_count INT DEFAULT 0,
                window_start TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
                INDEX idx_window_start (window_start)
            )
        ");
    }

    public function isAllowed($maxRequests = 100, $windowMinutes = 60) {
        $this->cleanOldEntries($windowMinutes);

        $stmt = $this->pdo->prepare("
            SELECT request_count, window_start 
            FROM rate_limits 
            WHERE identifier = ?
        ");
        $stmt->execute([$this->identifier]);
        $result = $stmt->fetch(PDO::FETCH_ASSOC);

        if (!$result) {
            $this->initializeEntry();
            return true;
        }

        $windowStart = strtotime($result['window_start']);
        $currentTime = time();

        // Reset counter if window has expired
        if ($currentTime - $windowStart > ($windowMinutes * 60)) {
            $this->resetEntry();
            return true;
        }

        // Check if limit exceeded
        if ($result['request_count'] >= $maxRequests) {
            return false;
        }

        // Increment counter
        $this->incrementCounter();
        return true;
    }

    private function initializeEntry() {
        $stmt = $this->pdo->prepare("
            INSERT INTO rate_limits (identifier, request_count, window_start) 
            VALUES (?, 1, NOW()) 
            ON DUPLICATE KEY UPDATE 
            request_count = 1, window_start = NOW()
        ");
        $stmt->execute([$this->identifier]);
    }

    private function resetEntry() {
        $stmt = $this->pdo->prepare("
            UPDATE rate_limits 
            SET request_count = 1, window_start = NOW() 
            WHERE identifier = ?
        ");
        $stmt->execute([$this->identifier]);
    }

    private function incrementCounter() {
        $stmt = $this->pdo->prepare("
            UPDATE rate_limits 
            SET request_count = request_count + 1 
            WHERE identifier = ?
        ");
        $stmt->execute([$this->identifier]);
    }

    private function cleanOldEntries($windowMinutes) {
        $stmt = $this->pdo->prepare("
            DELETE FROM rate_limits 
            WHERE window_start < DATE_SUB(NOW(), INTERVAL ? MINUTE)
        ");
        $stmt->execute([$windowMinutes * 2]); // Keep double the window for safety
    }
}

// Usage with MySQL/PostgreSQL
try {
    $pdo = new PDO('mysql:host=localhost;dbname=scraper', $username, $password);
    $pdo->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);

    $rateLimiter = new DatabaseRateLimiter($pdo, 'scraper_bot');

    if ($rateLimiter->isAllowed(1000, 60)) { // 1000 requests per hour
        // Proceed with scraping
        $response = file_get_contents('https://example.com');
        echo "Request successful\n";
    } else {
        echo "Rate limit exceeded. Please wait.\n";
    }
} catch (PDOException $e) {
    echo "Database error: " . $e->getMessage() . "\n";
}
?>

Adaptive Rate Limiting Based on Response Codes

Implement smart rate limiting that adjusts based on server responses:

<?php
class AdaptiveRateLimiter {
    private $baseDelay;
    private $currentDelay;
    private $maxDelay;
    private $backoffMultiplier;

    public function __construct($baseDelay = 1, $maxDelay = 60, $backoffMultiplier = 2) {
        $this->baseDelay = $baseDelay;
        $this->currentDelay = $baseDelay;
        $this->maxDelay = $maxDelay;
        $this->backoffMultiplier = $backoffMultiplier;
    }

    public function makeRequest($url) {
        sleep($this->currentDelay);

        $ch = curl_init();
        curl_setopt_array($ch, [
            CURLOPT_URL => $url,
            CURLOPT_RETURNTRANSFER => true,
            CURLOPT_TIMEOUT => 30,
        ]);

        $response = curl_exec($ch);
        $httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
        curl_close($ch);

        $this->adjustDelay($httpCode);

        return [
            'response' => $response,
            'http_code' => $httpCode,
            'current_delay' => $this->currentDelay
        ];
    }

    private function adjustDelay($httpCode) {
        switch (true) {
            case $httpCode === 429: // Too Many Requests
                $this->currentDelay = min(
                    $this->maxDelay,
                    $this->currentDelay * $this->backoffMultiplier
                );
                echo "Rate limited! Increasing delay to {$this->currentDelay}s\n";
                break;

            case $httpCode >= 500: // Server errors
                $this->currentDelay = min(
                    $this->maxDelay,
                    $this->currentDelay * 1.5
                );
                echo "Server error! Increasing delay to {$this->currentDelay}s\n";
                break;

            case $httpCode === 200: // Success
                // Gradually decrease delay on success
                $this->currentDelay = max(
                    $this->baseDelay,
                    $this->currentDelay * 0.9
                );
                break;
        }
    }
}

// Usage
$scraper = new AdaptiveRateLimiter(1, 30, 2);

$urls = [
    'https://httpbin.org/status/200',
    'https://httpbin.org/status/429',
    'https://httpbin.org/status/200',
];

foreach ($urls as $url) {
    $result = $scraper->makeRequest($url);
    echo "URL: $url, Status: {$result['http_code']}, Delay: {$result['current_delay']}s\n";
}
?>

Handling Multiple Domains with Different Rate Limits

When scraping multiple domains, each may have different rate limiting requirements:

<?php
class MultiDomainRateLimiter {
    private $domainLimits = [];
    private $lastRequests = [];

    public function addDomain($domain, $requestsPerMinute) {
        $this->domainLimits[$domain] = [
            'max_requests' => $requestsPerMinute,
            'min_delay' => 60 / $requestsPerMinute
        ];
        $this->lastRequests[$domain] = 0;
    }

    public function makeRequest($url) {
        $domain = parse_url($url, PHP_URL_HOST);

        if (!isset($this->domainLimits[$domain])) {
            throw new Exception("Rate limit not configured for domain: $domain");
        }

        $this->enforceRateLimit($domain);

        $ch = curl_init();
        curl_setopt_array($ch, [
            CURLOPT_URL => $url,
            CURLOPT_RETURNTRANSFER => true,
            CURLOPT_TIMEOUT => 30,
        ]);

        $response = curl_exec($ch);
        curl_close($ch);

        $this->lastRequests[$domain] = microtime(true);

        return $response;
    }

    private function enforceRateLimit($domain) {
        $config = $this->domainLimits[$domain];
        $lastRequest = $this->lastRequests[$domain];
        $currentTime = microtime(true);

        $timeSinceLastRequest = $currentTime - $lastRequest;

        if ($timeSinceLastRequest < $config['min_delay']) {
            $sleepTime = $config['min_delay'] - $timeSinceLastRequest;
            usleep($sleepTime * 1000000);
        }
    }
}

// Configuration
$scraper = new MultiDomainRateLimiter();
$scraper->addDomain('api.example.com', 60);      // 60 requests/minute
$scraper->addDomain('data.example.org', 30);     // 30 requests/minute
$scraper->addDomain('slow-api.example.net', 10); // 10 requests/minute

// Usage
$urls = [
    'https://api.example.com/data1',
    'https://data.example.org/info',
    'https://slow-api.example.net/content',
    'https://api.example.com/data2',
];

foreach ($urls as $url) {
    $response = $scraper->makeRequest($url);
    echo "Scraped: $url\n";
}
?>

Best Practices for Rate Limiting

  1. Start Conservative: Begin with longer delays and gradually optimize
  2. Monitor Response Codes: Watch for 429 (Too Many Requests) and 503 (Service Unavailable)
  3. Implement Exponential Backoff: Increase delays progressively when encountering rate limits
  4. Respect robots.txt: Check crawl-delay directives
  5. Use Random Delays: Add randomization to avoid predictable patterns
  6. Log Everything: Track request patterns and response codes

Conclusion

Implementing effective rate limiting in PHP web scraping scripts is essential for sustainable and respectful data extraction. Whether using simple sleep functions, sophisticated token bucket algorithms, or database-backed solutions, the key is finding the right balance between efficiency and compliance with website policies. For more complex scenarios involving JavaScript-heavy sites, consider exploring advanced tools that can handle dynamic content that loads after page navigation or implement retry logic for failed requests.

Remember to always test your rate limiting implementation thoroughly and adjust parameters based on the specific requirements and behavior of your target websites.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon