Table of contents

How do I Handle Rate Limiting When Scraping Multiple Pages?

Rate limiting is one of the most common challenges when scraping multiple pages from websites. When you make too many requests too quickly, servers often respond with HTTP 429 (Too Many Requests) errors or temporarily block your IP address. This guide covers comprehensive strategies to handle rate limiting effectively while maintaining efficient web scraping operations.

Understanding Rate Limiting

Rate limiting is a technique used by web servers to control the number of requests a client can make within a specific time window. Common rate limiting patterns include:

  • Request-per-second limits (e.g., 10 requests per second)
  • Request-per-minute limits (e.g., 100 requests per minute)
  • Burst limits (allowing short bursts but limiting sustained traffic)
  • IP-based blocking after threshold violations

Basic Delay Implementation

The simplest approach to handle rate limiting is implementing delays between requests. Here's how to do it with Simple HTML DOM Parser in PHP:

<?php
require_once 'simple_html_dom.php';

function scrapeWithDelay($urls, $delaySeconds = 1) {
    $results = [];

    foreach ($urls as $url) {
        // Add delay before each request (except the first one)
        if (!empty($results)) {
            sleep($delaySeconds);
        }

        try {
            $html = file_get_html($url);
            if ($html) {
                // Extract your data here
                $title = $html->find('title', 0)->plaintext ?? 'No title';
                $results[] = [
                    'url' => $url,
                    'title' => $title,
                    'timestamp' => date('Y-m-d H:i:s')
                ];
                $html->clear();
            }
        } catch (Exception $e) {
            error_log("Error scraping $url: " . $e->getMessage());
        }
    }

    return $results;
}

// Usage
$urls = [
    'https://example.com/page1',
    'https://example.com/page2',
    'https://example.com/page3'
];

$results = scrapeWithDelay($urls, 2); // 2-second delay between requests

Advanced Rate Limiting with Exponential Backoff

For more sophisticated rate limiting handling, implement exponential backoff when encountering rate limit errors:

<?php
class RateLimitedScraper {
    private $maxRetries;
    private $baseDelay;
    private $maxDelay;

    public function __construct($maxRetries = 3, $baseDelay = 1, $maxDelay = 60) {
        $this->maxRetries = $maxRetries;
        $this->baseDelay = $baseDelay;
        $this->maxDelay = $maxDelay;
    }

    public function scrapeWithRetry($url) {
        $attempt = 0;

        while ($attempt <= $this->maxRetries) {
            try {
                $context = stream_context_create([
                    'http' => [
                        'timeout' => 30,
                        'user_agent' => 'Mozilla/5.0 (Compatible Scraper)'
                    ]
                ]);

                $html = file_get_html($url, false, $context);

                if ($html === false) {
                    throw new Exception("Failed to fetch HTML");
                }

                return $html;

            } catch (Exception $e) {
                $attempt++;

                // Check if it's a rate limiting error
                if ($this->isRateLimitError($e) && $attempt <= $this->maxRetries) {
                    $delay = min(
                        $this->baseDelay * pow(2, $attempt - 1),
                        $this->maxDelay
                    );

                    echo "Rate limited. Waiting {$delay} seconds before retry {$attempt}...\n";
                    sleep($delay);
                } else {
                    throw $e;
                }
            }
        }

        throw new Exception("Max retries exceeded for URL: $url");
    }

    private function isRateLimitError($exception) {
        $message = $exception->getMessage();
        return strpos($message, '429') !== false || 
               strpos($message, 'rate limit') !== false ||
               strpos($message, 'too many requests') !== false;
    }
}

// Usage
$scraper = new RateLimitedScraper(3, 2, 30);

$urls = [
    'https://example.com/page1',
    'https://example.com/page2',
    'https://example.com/page3'
];

foreach ($urls as $url) {
    try {
        $html = $scraper->scrapeWithRetry($url);
        // Process the HTML
        $title = $html->find('title', 0)->plaintext ?? 'No title';
        echo "Scraped: $title\n";
        $html->clear();

        // Base delay between successful requests
        sleep(1);
    } catch (Exception $e) {
        echo "Failed to scrape $url: " . $e->getMessage() . "\n";
    }
}

JavaScript Implementation with Async/Await

When working with JavaScript and Node.js, you can implement similar rate limiting strategies:

const cheerio = require('cheerio');
const axios = require('axios');

class RateLimitedScraper {
    constructor(maxRetries = 3, baseDelay = 1000, maxDelay = 60000) {
        this.maxRetries = maxRetries;
        this.baseDelay = baseDelay;
        this.maxDelay = maxDelay;
    }

    async delay(ms) {
        return new Promise(resolve => setTimeout(resolve, ms));
    }

    async scrapeWithRetry(url) {
        let attempt = 0;

        while (attempt <= this.maxRetries) {
            try {
                const response = await axios.get(url, {
                    timeout: 30000,
                    headers: {
                        'User-Agent': 'Mozilla/5.0 (Compatible Scraper)'
                    }
                });

                return cheerio.load(response.data);

            } catch (error) {
                attempt++;

                if (this.isRateLimitError(error) && attempt <= this.maxRetries) {
                    const delayMs = Math.min(
                        this.baseDelay * Math.pow(2, attempt - 1),
                        this.maxDelay
                    );

                    console.log(`Rate limited. Waiting ${delayMs}ms before retry ${attempt}...`);
                    await this.delay(delayMs);
                } else {
                    throw error;
                }
            }
        }

        throw new Error(`Max retries exceeded for URL: ${url}`);
    }

    isRateLimitError(error) {
        return error.response && error.response.status === 429;
    }

    async scrapeMultiplePages(urls, delayBetweenRequests = 1000) {
        const results = [];

        for (let i = 0; i < urls.length; i++) {
            try {
                const $ = await this.scrapeWithRetry(urls[i]);
                const title = $('title').text() || 'No title';

                results.push({
                    url: urls[i],
                    title: title,
                    timestamp: new Date().toISOString()
                });

                // Add delay between requests (except for the last one)
                if (i < urls.length - 1) {
                    await this.delay(delayBetweenRequests);
                }

            } catch (error) {
                console.error(`Failed to scrape ${urls[i]}:`, error.message);
            }
        }

        return results;
    }
}

// Usage
async function main() {
    const scraper = new RateLimitedScraper(3, 2000, 30000);

    const urls = [
        'https://example.com/page1',
        'https://example.com/page2',
        'https://example.com/page3'
    ];

    const results = await scraper.scrapeMultiplePages(urls, 2000);
    console.log('Scraping results:', results);
}

main().catch(console.error);

Implementing Request Queues

For large-scale scraping operations, consider implementing a request queue system:

<?php
class ScrapingQueue {
    private $queue = [];
    private $processing = false;
    private $requestsPerMinute;
    private $lastRequestTime;

    public function __construct($requestsPerMinute = 30) {
        $this->requestsPerMinute = $requestsPerMinute;
        $this->lastRequestTime = 0;
    }

    public function addUrl($url, $callback = null) {
        $this->queue[] = [
            'url' => $url,
            'callback' => $callback,
            'attempts' => 0
        ];
    }

    public function processQueue() {
        $this->processing = true;
        $minInterval = 60 / $this->requestsPerMinute; // seconds between requests

        while (!empty($this->queue) && $this->processing) {
            $item = array_shift($this->queue);

            // Ensure minimum interval between requests
            $timeSinceLastRequest = microtime(true) - $this->lastRequestTime;
            if ($timeSinceLastRequest < $minInterval) {
                $sleepTime = $minInterval - $timeSinceLastRequest;
                usleep($sleepTime * 1000000); // Convert to microseconds
            }

            try {
                $html = file_get_html($item['url']);
                $this->lastRequestTime = microtime(true);

                if ($html && $item['callback']) {
                    call_user_func($item['callback'], $html, $item['url']);
                }

                if ($html) {
                    $html->clear();
                }

            } catch (Exception $e) {
                $item['attempts']++;

                // Retry logic
                if ($item['attempts'] < 3) {
                    // Re-add to end of queue for retry
                    $this->queue[] = $item;
                    // Add extra delay for failed requests
                    sleep(5);
                } else {
                    error_log("Failed to scrape after 3 attempts: " . $item['url']);
                }
            }
        }
    }

    public function stop() {
        $this->processing = false;
    }
}

// Usage
$queue = new ScrapingQueue(20); // 20 requests per minute

// Add URLs to queue
$urls = ['https://example.com/page1', 'https://example.com/page2'];
foreach ($urls as $url) {
    $queue->addUrl($url, function($html, $url) {
        $title = $html->find('title', 0)->plaintext ?? 'No title';
        echo "Scraped $url: $title\n";
    });
}

$queue->processQueue();

Monitoring and Adaptive Rate Limiting

Implement monitoring to automatically adjust your scraping rate based on server responses:

<?php
class AdaptiveRateLimiter {
    private $successCount = 0;
    private $errorCount = 0;
    private $currentDelay = 1;
    private $minDelay = 0.5;
    private $maxDelay = 10;

    public function adjustDelay($success) {
        if ($success) {
            $this->successCount++;
            // Gradually decrease delay on success
            if ($this->successCount % 10 == 0) {
                $this->currentDelay = max($this->minDelay, $this->currentDelay * 0.9);
            }
        } else {
            $this->errorCount++;
            // Increase delay on error
            $this->currentDelay = min($this->maxDelay, $this->currentDelay * 1.5);
            $this->successCount = 0; // Reset success count
        }
    }

    public function getDelay() {
        return $this->currentDelay;
    }

    public function getStats() {
        return [
            'success_count' => $this->successCount,
            'error_count' => $this->errorCount,
            'current_delay' => $this->currentDelay
        ];
    }
}

Best Practices for Rate Limiting

  1. Respect robots.txt: Always check the website's robots.txt file for crawl delays and restrictions.

  2. Use appropriate User-Agent headers: Identify your scraper properly and include contact information.

  3. Implement graceful degradation: When rate limited, gradually reduce your request rate rather than stopping completely.

  4. Monitor server response times: Slower responses might indicate server stress; adjust your rate accordingly.

  5. Use distributed scraping: For large-scale operations, consider running multiple scrapers in parallel across different IP addresses.

  6. Cache responses: Avoid re-scraping the same content by implementing intelligent caching mechanisms.

  7. Handle different rate limit types: Some sites have different limits for different endpoints or user types.

Console Commands for Testing

Test your rate limiting implementation with these commands:

# Test with curl to check rate limiting responses
curl -w "%{http_code}\n" -o /dev/null -s "https://example.com/api/endpoint"

# Monitor your scraping logs
tail -f scraping.log | grep -E "(429|rate|limit)"

# Check your current request rate
watch -n 1 'tail -20 access.log | grep "$(date +%H:%M)" | wc -l'

Advanced Considerations

When dealing with sophisticated anti-bot systems, you might need to combine rate limiting with other techniques like handling timeouts effectively and implementing proper session management.

For JavaScript-heavy sites that require more complex interactions, consider the additional overhead of browser automation tools and adjust your rate limits accordingly.

Conclusion

Effective rate limiting is crucial for sustainable web scraping operations. By implementing delays, retry logic with exponential backoff, and adaptive rate limiting, you can build robust scrapers that respect server resources while maintaining efficiency. Remember to always monitor your scraping performance and adjust your strategies based on the specific requirements and limitations of each target website.

The key is finding the right balance between scraping speed and server courtesy, ensuring your scraping operations remain both effective and respectful of the target website's resources.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon