Table of contents

How do I implement rate limiting to avoid being blocked while scraping?

Rate limiting is a crucial technique in web scraping that helps you avoid being detected and blocked by target websites. By controlling the frequency of your requests, you can mimic human browsing behavior and maintain a respectful scraping approach that doesn't overwhelm the server.

Understanding Rate Limiting in Web Scraping

Rate limiting involves introducing deliberate delays between HTTP requests to prevent overwhelming the target server. Most websites implement anti-bot measures that detect and block suspicious traffic patterns, including:

  • Too many requests per second from a single IP address
  • Consistent timing patterns that don't match human behavior
  • Lack of variation in request intervals
  • Missing typical browser behaviors like loading images and stylesheets

Implementing Rate Limiting with Alamofire

Basic Fixed Delay Implementation

Here's a simple Swift implementation using Alamofire with fixed delays:

import Alamofire
import Foundation

class RateLimitedScraper {
    private let session: Session
    private let requestDelay: TimeInterval
    private let queue = DispatchQueue(label: "scraping.queue", qos: .background)

    init(requestDelay: TimeInterval = 1.0) {
        self.requestDelay = requestDelay

        // Configure session with custom settings
        let configuration = URLSessionConfiguration.default
        configuration.timeoutIntervalForRequest = 30
        configuration.timeoutIntervalForResource = 60
        self.session = Session(configuration: configuration)
    }

    func scrapeURLs(_ urls: [String], completion: @escaping ([String: Any]) -> Void) {
        var results: [String: Any] = [:]
        let group = DispatchGroup()

        for (index, url) in urls.enumerated() {
            group.enter()

            // Calculate delay based on index
            let delay = Double(index) * requestDelay

            queue.asyncAfter(deadline: .now() + delay) {
                self.session.request(url)
                    .responseString { response in
                        switch response.result {
                        case .success(let html):
                            results[url] = html
                        case .failure(let error):
                            results[url] = "Error: \(error.localizedDescription)"
                        }
                        group.leave()
                    }
            }
        }

        group.notify(queue: .main) {
            completion(results)
        }
    }
}

// Usage
let scraper = RateLimitedScraper(requestDelay: 2.0)
let urls = [
    "https://example.com/page1",
    "https://example.com/page2",
    "https://example.com/page3"
]

scraper.scrapeURLs(urls) { results in
    print("Scraping completed: \(results.keys.count) pages processed")
}

Advanced Random Delay Implementation

To better mimic human behavior, implement random delays with jitter:

class AdvancedRateLimitedScraper {
    private let session: Session
    private let minDelay: TimeInterval
    private let maxDelay: TimeInterval
    private let queue = DispatchQueue(label: "scraping.queue", qos: .background)

    init(minDelay: TimeInterval = 1.0, maxDelay: TimeInterval = 5.0) {
        self.minDelay = minDelay
        self.maxDelay = maxDelay

        let configuration = URLSessionConfiguration.default
        configuration.httpAdditionalHeaders = [
            "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36"
        ]
        self.session = Session(configuration: configuration)
    }

    private func randomDelay() -> TimeInterval {
        return TimeInterval.random(in: minDelay...maxDelay)
    }

    func scrapeWithRandomDelay(url: String, completion: @escaping (Result<String, Error>) -> Void) {
        let delay = randomDelay()

        queue.asyncAfter(deadline: .now() + delay) {
            self.session.request(url)
                .validate(statusCode: 200..<300)
                .responseString { response in
                    DispatchQueue.main.async {
                        completion(response.result)
                    }
                }
        }
    }
}

Rate Limiting in Other Languages

Python with Requests

import time
import random
import requests
from concurrent.futures import ThreadPoolExecutor
import threading

class RateLimitedScraper:
    def __init__(self, min_delay=1.0, max_delay=3.0, max_workers=3):
        self.min_delay = min_delay
        self.max_delay = max_delay
        self.max_workers = max_workers
        self.last_request_time = 0
        self.lock = threading.Lock()

        self.session = requests.Session()
        self.session.headers.update({
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        })

    def _wait_for_rate_limit(self):
        """Ensure minimum delay between requests"""
        with self.lock:
            current_time = time.time()
            delay = random.uniform(self.min_delay, self.max_delay)
            time_since_last = current_time - self.last_request_time

            if time_since_last < delay:
                sleep_time = delay - time_since_last
                time.sleep(sleep_time)

            self.last_request_time = time.time()

    def scrape_url(self, url):
        """Scrape a single URL with rate limiting"""
        self._wait_for_rate_limit()

        try:
            response = self.session.get(url, timeout=30)
            response.raise_for_status()
            return {'url': url, 'content': response.text, 'status': 'success'}
        except requests.exceptions.RequestException as e:
            return {'url': url, 'error': str(e), 'status': 'error'}

    def scrape_urls(self, urls):
        """Scrape multiple URLs with controlled concurrency"""
        with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
            results = list(executor.map(self.scrape_url, urls))
        return results

# Usage
scraper = RateLimitedScraper(min_delay=1.5, max_delay=4.0, max_workers=2)
urls = ['https://example.com/page1', 'https://example.com/page2']
results = scraper.scrape_urls(urls)

JavaScript with Axios

const axios = require('axios');

class RateLimitedScraper {
    constructor(minDelay = 1000, maxDelay = 3000, maxConcurrent = 3) {
        this.minDelay = minDelay;
        this.maxDelay = maxDelay;
        this.maxConcurrent = maxConcurrent;
        this.lastRequestTime = 0;
        this.activeRequests = 0;

        this.client = axios.create({
            timeout: 30000,
            headers: {
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
            }
        });
    }

    async delay() {
        const randomDelay = Math.random() * (this.maxDelay - this.minDelay) + this.minDelay;
        const now = Date.now();
        const timeSinceLastRequest = now - this.lastRequestTime;

        if (timeSinceLastRequest < randomDelay) {
            const waitTime = randomDelay - timeSinceLastRequest;
            await new Promise(resolve => setTimeout(resolve, waitTime));
        }

        this.lastRequestTime = Date.now();
    }

    async scrapeUrl(url) {
        // Wait for rate limit
        await this.delay();

        // Control concurrency
        while (this.activeRequests >= this.maxConcurrent) {
            await new Promise(resolve => setTimeout(resolve, 100));
        }

        this.activeRequests++;

        try {
            const response = await this.client.get(url);
            return {
                url,
                content: response.data,
                status: 'success',
                statusCode: response.status
            };
        } catch (error) {
            return {
                url,
                error: error.message,
                status: 'error',
                statusCode: error.response?.status
            };
        } finally {
            this.activeRequests--;
        }
    }

    async scrapeUrls(urls) {
        const promises = urls.map(url => this.scrapeUrl(url));
        return Promise.all(promises);
    }
}

// Usage
const scraper = new RateLimitedScraper(1500, 4000, 2);
const urls = ['https://example.com/page1', 'https://example.com/page2'];

scraper.scrapeUrls(urls)
    .then(results => console.log('Scraping completed:', results))
    .catch(error => console.error('Scraping failed:', error));

Advanced Rate Limiting Strategies

Exponential Backoff

Implement exponential backoff for handling rate limit responses:

class ExponentialBackoffScraper {
    private let session: Session
    private let baseDelay: TimeInterval = 1.0
    private let maxRetries: Int = 3

    init() {
        self.session = Session()
    }

    func scrapeWithBackoff(url: String, attempt: Int = 0, completion: @escaping (Result<String, Error>) -> Void) {
        if attempt >= maxRetries {
            completion(.failure(NSError(domain: "ScrapingError", code: 429, userInfo: [NSLocalizedDescriptionKey: "Max retries exceeded"])))
            return
        }

        session.request(url).responseString { [weak self] response in
            switch response.response?.statusCode {
            case 429, 503: // Rate limited or service unavailable
                let delay = self?.baseDelay * pow(2.0, Double(attempt)) ?? 1.0
                DispatchQueue.global().asyncAfter(deadline: .now() + delay) {
                    self?.scrapeWithBackoff(url: url, attempt: attempt + 1, completion: completion)
                }
            case 200...299:
                completion(response.result)
            default:
                if let error = response.error {
                    completion(.failure(error))
                }
            }
        }
    }
}

Token Bucket Algorithm

For more sophisticated rate limiting, implement a token bucket:

import time
import threading

class TokenBucket:
    def __init__(self, capacity, refill_rate):
        self.capacity = capacity
        self.tokens = capacity
        self.refill_rate = refill_rate
        self.last_refill = time.time()
        self.lock = threading.Lock()

    def consume(self, tokens=1):
        with self.lock:
            now = time.time()
            # Add tokens based on time elapsed
            tokens_to_add = (now - self.last_refill) * self.refill_rate
            self.tokens = min(self.capacity, self.tokens + tokens_to_add)
            self.last_refill = now

            if self.tokens >= tokens:
                self.tokens -= tokens
                return True
            return False

    def wait_for_token(self):
        while not self.consume():
            time.sleep(0.1)

class TokenBucketScraper:
    def __init__(self, requests_per_minute=30):
        # Allow burst of 10 requests, refill at specified rate
        self.bucket = TokenBucket(capacity=10, refill_rate=requests_per_minute/60)
        self.session = requests.Session()

    def scrape_url(self, url):
        self.bucket.wait_for_token()
        return self.session.get(url)

Best Practices for Rate Limiting

1. Respect robots.txt

Always check the website's robots.txt file and implement appropriate crawl delays:

import urllib.robotparser

def get_crawl_delay(url):
    rp = urllib.robotparser.RobotFileParser()
    rp.set_url(f"{url}/robots.txt")
    rp.read()

    crawl_delay = rp.crawl_delay("*")
    return crawl_delay if crawl_delay else 1.0

2. Monitor Response Codes

Watch for rate limiting indicators and adjust accordingly:

func handleRateLimitResponse(_ response: HTTPURLResponse?) -> Bool {
    guard let statusCode = response?.statusCode else { return false }

    switch statusCode {
    case 429: // Too Many Requests
        return true
    case 503: // Service Unavailable
        return true
    case 200...299:
        return false
    default:
        return false
    }
}

3. Use Distributed Rate Limiting

For large-scale scraping operations, consider implementing distributed rate limiting with Redis or similar systems to coordinate between multiple scraper instances.

Testing Rate Limiting Effectiveness

Monitor your scraping success rate and adjust parameters:

# Monitor HTTP status codes
curl -w "Status: %{http_code}, Time: %{time_total}s\n" -o /dev/null -s "https://example.com"

# Test with different delay intervals
for delay in 1 2 3 5; do
    echo "Testing with ${delay}s delay"
    # Your scraping command here
    sleep $delay
done

Integration with WebScraping.AI

For production applications, consider using a managed scraping service that handles rate limiting automatically. When handling browser sessions in Puppeteer or performing complex scraping tasks, professional APIs can manage these concerns for you while providing better reliability and legal compliance.

Conclusion

Effective rate limiting is essential for successful web scraping. Start with conservative delays (2-5 seconds between requests) and adjust based on the target website's response. Always combine rate limiting with other best practices like rotating user agents, handling errors gracefully, and respecting website terms of service.

Remember that rate limiting is not just about avoiding blocks—it's about being a responsible internet citizen and maintaining sustainable scraping operations that don't negatively impact the websites you're accessing.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon