Table of contents

How do I manage rate limiting and delays in Swift web scraping?

Managing rate limiting and implementing delays is crucial for responsible web scraping in Swift. Proper rate limiting helps you avoid being blocked by websites, reduces server load, and ensures your scraping operations remain sustainable over time. This guide covers various strategies and implementation techniques for managing request rates in Swift web scraping applications.

Why Rate Limiting Matters

Rate limiting is essential for several reasons:

  • Avoiding IP bans: Websites often block IP addresses that make too many requests too quickly
  • Server respect: Reduces load on target servers and demonstrates ethical scraping practices
  • Legal compliance: Many websites' terms of service require reasonable request rates
  • Resource management: Prevents overwhelming your own application's network resources

Basic Delay Implementation

The simplest approach to rate limiting is implementing delays between requests using Swift's built-in timing mechanisms:

import Foundation

class WebScraper {
    private let session = URLSession.shared
    private let delayInterval: TimeInterval

    init(delayInterval: TimeInterval = 1.0) {
        self.delayInterval = delayInterval
    }

    func scrapeURLs(_ urls: [URL]) async throws {
        for (index, url) in urls.enumerated() {
            // Add delay before each request (except the first one)
            if index > 0 {
                try await Task.sleep(nanoseconds: UInt64(delayInterval * 1_000_000_000))
            }

            let (data, response) = try await session.data(from: url)

            // Process the response
            await processResponse(data: data, response: response)
        }
    }

    private func processResponse(data: Data, response: URLResponse) async {
        // Your scraping logic here
        print("Scraped data from: \(response.url?.absoluteString ?? "unknown")")
    }
}

Advanced Rate Limiting with Token Bucket Algorithm

For more sophisticated rate limiting, implement a token bucket algorithm that allows bursts while maintaining an average rate:

import Foundation

actor TokenBucket {
    private let capacity: Int
    private let refillRate: Double // tokens per second
    private var tokens: Double
    private var lastRefill: Date

    init(capacity: Int, refillRate: Double) {
        self.capacity = capacity
        self.refillRate = refillRate
        self.tokens = Double(capacity)
        self.lastRefill = Date()
    }

    func consume(tokens: Int = 1) async -> Bool {
        await refill()

        if self.tokens >= Double(tokens) {
            self.tokens -= Double(tokens)
            return true
        }
        return false
    }

    private func refill() {
        let now = Date()
        let elapsed = now.timeIntervalSince(lastRefill)
        let tokensToAdd = elapsed * refillRate

        tokens = min(Double(capacity), tokens + tokensToAdd)
        lastRefill = now
    }
}

class RateLimitedScraper {
    private let tokenBucket: TokenBucket
    private let session = URLSession.shared

    init(requestsPerSecond: Double, burstCapacity: Int = 10) {
        self.tokenBucket = TokenBucket(
            capacity: burstCapacity,
            refillRate: requestsPerSecond
        )
    }

    func scrapeURL(_ url: URL) async throws -> Data {
        // Wait for a token to become available
        while await !tokenBucket.consume() {
            try await Task.sleep(nanoseconds: 100_000_000) // 0.1 second
        }

        let (data, _) = try await session.data(from: url)
        return data
    }
}

Implementing Exponential Backoff

When dealing with temporary failures or rate limit responses, exponential backoff helps you retry requests with increasing delays:

import Foundation

class BackoffScraper {
    private let session = URLSession.shared
    private let maxRetries: Int
    private let baseDelay: TimeInterval

    init(maxRetries: Int = 5, baseDelay: TimeInterval = 1.0) {
        self.maxRetries = maxRetries
        self.baseDelay = baseDelay
    }

    func scrapeWithBackoff(_ url: URL) async throws -> Data {
        var lastError: Error?

        for attempt in 0..<maxRetries {
            do {
                let (data, response) = try await session.data(from: url)

                if let httpResponse = response as? HTTPURLResponse {
                    switch httpResponse.statusCode {
                    case 200...299:
                        return data
                    case 429, 503:
                        // Rate limited or service unavailable
                        throw ScrapingError.rateLimited
                    case 500...599:
                        // Server error, worth retrying
                        throw ScrapingError.serverError
                    default:
                        throw ScrapingError.httpError(httpResponse.statusCode)
                    }
                }

                return data

            } catch {
                lastError = error

                if attempt < maxRetries - 1 {
                    let delay = baseDelay * pow(2.0, Double(attempt))
                    let jitter = Double.random(in: 0...0.1) * delay

                    try await Task.sleep(nanoseconds: UInt64((delay + jitter) * 1_000_000_000))
                }
            }
        }

        throw lastError ?? ScrapingError.maxRetriesExceeded
    }
}

enum ScrapingError: Error {
    case rateLimited
    case serverError
    case httpError(Int)
    case maxRetriesExceeded
}

Respecting robots.txt and Crawl-Delay

Responsible web scraping includes checking and respecting the robots.txt file and any specified crawl delays:

import Foundation

struct RobotsParser {
    let crawlDelay: TimeInterval?
    let allowedPaths: [String]
    let disallowedPaths: [String]

    static func parse(from url: URL) async throws -> RobotsParser? {
        let robotsURL = url.appendingPathComponent("robots.txt")

        do {
            let (data, _) = try await URLSession.shared.data(from: robotsURL)
            let content = String(data: data, encoding: .utf8) ?? ""
            return parseRobotsContent(content)
        } catch {
            // If robots.txt doesn't exist, assume scraping is allowed
            return nil
        }
    }

    private static func parseRobotsContent(_ content: String) -> RobotsParser {
        var crawlDelay: TimeInterval?
        var allowedPaths: [String] = []
        var disallowedPaths: [String] = []
        var isRelevantSection = false

        for line in content.components(separatedBy: .newlines) {
            let trimmed = line.trimmingCharacters(in: .whitespaces).lowercased()

            if trimmed.hasPrefix("user-agent:") {
                let agent = trimmed.replacingOccurrences(of: "user-agent:", with: "").trimmingCharacters(in: .whitespaces)
                isRelevantSection = agent == "*" || agent.contains("your-bot-name")
            } else if isRelevantSection {
                if trimmed.hasPrefix("crawl-delay:") {
                    let delayString = trimmed.replacingOccurrences(of: "crawl-delay:", with: "").trimmingCharacters(in: .whitespaces)
                    crawlDelay = TimeInterval(delayString)
                } else if trimmed.hasPrefix("allow:") {
                    let path = trimmed.replacingOccurrences(of: "allow:", with: "").trimmingCharacters(in: .whitespaces)
                    allowedPaths.append(path)
                } else if trimmed.hasPrefix("disallow:") {
                    let path = trimmed.replacingOccurrences(of: "disallow:", with: "").trimmingCharacters(in: .whitespaces)
                    disallowedPaths.append(path)
                }
            }
        }

        return RobotsParser(crawlDelay: crawlDelay, allowedPaths: allowedPaths, disallowedPaths: disallowedPaths)
    }
}

class RespectfulScraper {
    private var robotsCache: [String: RobotsParser?] = [:]
    private let session = URLSession.shared

    func scrapeRespectfully(_ url: URL) async throws -> Data {
        let host = url.host ?? ""

        // Check robots.txt if not cached
        if robotsCache[host] == nil {
            robotsCache[host] = try await RobotsParser.parse(from: url)
        }

        if let robots = robotsCache[host] {
            // Check if path is allowed
            let path = url.path
            for disallowed in robots?.disallowedPaths ?? [] {
                if path.hasPrefix(disallowed) {
                    throw ScrapingError.robotsDisallowed
                }
            }

            // Apply crawl delay if specified
            if let delay = robots?.crawlDelay {
                try await Task.sleep(nanoseconds: UInt64(delay * 1_000_000_000))
            }
        }

        let (data, _) = try await session.data(from: url)
        return data
    }
}

extension ScrapingError {
    static let robotsDisallowed = ScrapingError.httpError(-1)
}

Concurrent Scraping with Rate Limiting

When scraping multiple URLs concurrently, you need to coordinate rate limiting across all operations:

import Foundation

actor ConcurrentRateLimiter {
    private let semaphore: AsyncSemaphore
    private let minimumDelay: TimeInterval
    private var lastRequestTime: Date = Date.distantPast

    init(maxConcurrentRequests: Int, minimumDelay: TimeInterval) {
        self.semaphore = AsyncSemaphore(value: maxConcurrentRequests)
        self.minimumDelay = minimumDelay
    }

    func performRequest<T>(_ operation: @Sendable () async throws -> T) async throws -> T {
        await semaphore.wait()
        defer { semaphore.signal() }

        // Ensure minimum delay between requests
        let now = Date()
        let timeSinceLastRequest = now.timeIntervalSince(lastRequestTime)
        if timeSinceLastRequest < minimumDelay {
            let sleepTime = minimumDelay - timeSinceLastRequest
            try await Task.sleep(nanoseconds: UInt64(sleepTime * 1_000_000_000))
        }

        lastRequestTime = Date()
        return try await operation()
    }
}

// AsyncSemaphore implementation
actor AsyncSemaphore {
    private var value: Int
    private var waiters: [CheckedContinuation<Void, Never>] = []

    init(value: Int) {
        self.value = value
    }

    func wait() async {
        if value > 0 {
            value -= 1
        } else {
            await withCheckedContinuation { continuation in
                waiters.append(continuation)
            }
        }
    }

    func signal() {
        if waiters.isEmpty {
            value += 1
        } else {
            let waiter = waiters.removeFirst()
            waiter.resume()
        }
    }
}

class ConcurrentScraper {
    private let rateLimiter: ConcurrentRateLimiter
    private let session = URLSession.shared

    init(maxConcurrentRequests: Int = 3, delayBetweenRequests: TimeInterval = 1.0) {
        self.rateLimiter = ConcurrentRateLimiter(
            maxConcurrentRequests: maxConcurrentRequests,
            minimumDelay: delayBetweenRequests
        )
    }

    func scrapeURLsConcurrently(_ urls: [URL]) async throws -> [Data] {
        return try await withThrowingTaskGroup(of: (Int, Data).self) { group in
            for (index, url) in urls.enumerated() {
                group.addTask { [weak self] in
                    guard let self = self else { throw ScrapingError.maxRetriesExceeded }

                    let data = try await self.rateLimiter.performRequest {
                        let (data, _) = try await self.session.data(from: url)
                        return data
                    }

                    return (index, data)
                }
            }

            var results: [(Int, Data)] = []
            for try await result in group {
                results.append(result)
            }

            // Sort results by original index to maintain order
            results.sort { $0.0 < $1.0 }
            return results.map { $0.1 }
        }
    }
}

Monitoring and Adaptive Rate Limiting

Implement monitoring to automatically adjust your rate limits based on server responses:

import Foundation

class AdaptiveRateLimiter {
    private var currentDelay: TimeInterval
    private let minDelay: TimeInterval
    private let maxDelay: TimeInterval
    private var consecutiveSuccesses = 0
    private var consecutiveFailures = 0

    init(initialDelay: TimeInterval = 1.0, minDelay: TimeInterval = 0.5, maxDelay: TimeInterval = 10.0) {
        self.currentDelay = initialDelay
        self.minDelay = minDelay
        self.maxDelay = maxDelay
    }

    func adjustForSuccess() {
        consecutiveSuccesses += 1
        consecutiveFailures = 0

        // Gradually decrease delay after multiple successes
        if consecutiveSuccesses >= 5 {
            currentDelay = max(minDelay, currentDelay * 0.9)
            consecutiveSuccesses = 0
        }
    }

    func adjustForFailure() {
        consecutiveFailures += 1
        consecutiveSuccesses = 0

        // Increase delay immediately on failure
        currentDelay = min(maxDelay, currentDelay * 1.5)
    }

    func getCurrentDelay() -> TimeInterval {
        return currentDelay
    }
}

Best Practices Summary

  1. Start Conservative: Begin with longer delays and reduce them gradually based on server responses
  2. Monitor Response Codes: Watch for 429 (Too Many Requests) and 503 (Service Unavailable) status codes
  3. Implement Jitter: Add random variation to delays to avoid thundering herd problems
  4. Respect robots.txt: Always check and follow robots.txt guidelines
  5. Use Appropriate User-Agents: Identify your scraper appropriately and consider contacting website owners
  6. Log Rate Limit Events: Keep track of when you hit rate limits to optimize your approach

When implementing rate limiting in Swift web scraping, similar principles apply as with handling timeouts in browser automation - both require careful timing management and error handling strategies.

By implementing these rate limiting strategies, you'll create more robust and respectful web scraping applications that are less likely to be blocked and more likely to maintain long-term access to your target websites. Remember that good rate limiting is not just about avoiding blocks - it's about being a responsible citizen of the web.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon