Table of contents

How do I implement concurrent web scraping in Swift?

Concurrent web scraping in Swift allows you to fetch data from multiple URLs simultaneously, significantly improving performance compared to sequential scraping. This guide covers modern Swift concurrency features including async/await, URLSession, and proper error handling for efficient web scraping operations.

Understanding Swift Concurrency for Web Scraping

Swift's modern concurrency model, introduced in Swift 5.5, provides powerful tools for concurrent web scraping:

  • async/await: Simplifies asynchronous code
  • URLSession: Native HTTP client with excellent concurrency support
  • TaskGroup: Manages multiple concurrent tasks
  • Actor: Ensures thread-safe data access

Basic Concurrent Web Scraping Setup

Setting Up URLSession for Concurrent Requests

import Foundation

class WebScraper {
    private let session: URLSession

    init() {
        let config = URLSessionConfiguration.default
        config.timeoutIntervalForRequest = 30
        config.timeoutIntervalForResource = 60
        config.httpMaximumConnectionsPerHost = 10
        self.session = URLSession(configuration: config)
    }

    func scrapeURL(_ url: URL) async throws -> String {
        let (data, response) = try await session.data(from: url)

        guard let httpResponse = response as? HTTPURLResponse,
              200...299 ~= httpResponse.statusCode else {
            throw ScrapingError.invalidResponse
        }

        return String(data: data, encoding: .utf8) ?? ""
    }
}

enum ScrapingError: Error {
    case invalidResponse
    case noData
    case parsingError
}

Implementing Concurrent Scraping with TaskGroup

import Foundation

extension WebScraper {
    func scrapeMultipleURLs(_ urls: [URL]) async -> [String] {
        await withTaskGroup(of: String?.self) { group in
            var results: [String] = []

            // Add tasks to the group
            for url in urls {
                group.addTask {
                    do {
                        return try await self.scrapeURL(url)
                    } catch {
                        print("Failed to scrape \(url): \(error)")
                        return nil
                    }
                }
            }

            // Collect results
            for await result in group {
                if let content = result {
                    results.append(content)
                }
            }

            return results
        }
    }
}

Advanced Concurrent Scraping Techniques

Rate Limiting with Semaphore

To avoid overwhelming servers, implement rate limiting:

actor RateLimitedScraper {
    private let maxConcurrentRequests: Int
    private let session: URLSession
    private var activeTasks = 0

    init(maxConcurrentRequests: Int = 5) {
        self.maxConcurrentRequests = maxConcurrentRequests
        let config = URLSessionConfiguration.default
        config.httpMaximumConnectionsPerHost = maxConcurrentRequests
        self.session = URLSession(configuration: config)
    }

    func scrapeWithRateLimit(_ url: URL) async throws -> String {
        await waitForAvailableSlot()

        defer {
            Task { await releaseSlot() }
        }

        let (data, response) = try await session.data(from: url)

        guard let httpResponse = response as? HTTPURLResponse,
              200...299 ~= httpResponse.statusCode else {
            throw ScrapingError.invalidResponse
        }

        return String(data: data, encoding: .utf8) ?? ""
    }

    private func waitForAvailableSlot() async {
        while activeTasks >= maxConcurrentRequests {
            try? await Task.sleep(nanoseconds: 100_000_000) // 0.1 seconds
        }
        activeTasks += 1
    }

    private func releaseSlot() {
        activeTasks = max(0, activeTasks - 1)
    }
}

Batch Processing with Delays

For large-scale scraping, process URLs in batches with delays:

extension WebScraper {
    func scrapeBatches(urls: [URL], batchSize: Int = 10, delayBetweenBatches: TimeInterval = 1.0) async -> [String] {
        var allResults: [String] = []

        for batch in urls.chunked(into: batchSize) {
            let batchResults = await scrapeMultipleURLs(batch)
            allResults.append(contentsOf: batchResults)

            // Delay between batches
            if batch != urls.chunked(into: batchSize).last {
                try? await Task.sleep(nanoseconds: UInt64(delayBetweenBatches * 1_000_000_000))
            }
        }

        return allResults
    }
}

extension Array {
    func chunked(into size: Int) -> [[Element]] {
        return stride(from: 0, to: count, by: size).map {
            Array(self[$0..<Swift.min($0 + size, count)])
        }
    }
}

Error Handling and Retry Logic

Implementing Robust Error Handling

extension WebScraper {
    func scrapeWithRetry(_ url: URL, maxRetries: Int = 3) async throws -> String {
        var lastError: Error?

        for attempt in 0..<maxRetries {
            do {
                return try await scrapeURL(url)
            } catch {
                lastError = error
                let delay = pow(2.0, Double(attempt)) // Exponential backoff
                try? await Task.sleep(nanoseconds: UInt64(delay * 1_000_000_000))
            }
        }

        throw lastError ?? ScrapingError.noData
    }

    func scrapeMultipleWithRetry(_ urls: [URL]) async -> [(url: URL, result: Result<String, Error>)] {
        await withTaskGroup(of: (URL, Result<String, Error>).self) { group in
            var results: [(URL, Result<String, Error>)] = []

            for url in urls {
                group.addTask {
                    do {
                        let content = try await self.scrapeWithRetry(url)
                        return (url, .success(content))
                    } catch {
                        return (url, .failure(error))
                    }
                }
            }

            for await result in group {
                results.append(result)
            }

            return results
        }
    }
}

Data Processing and Storage

Concurrent Data Processing

struct ScrapedData {
    let url: URL
    let title: String
    let content: String
    let timestamp: Date
}

actor DataProcessor {
    private var processedData: [ScrapedData] = []

    func processHTML(_ html: String, from url: URL) async -> ScrapedData? {
        // Simple HTML parsing (consider using a proper HTML parser)
        let title = extractTitle(from: html)
        let content = extractContent(from: html)

        let data = ScrapedData(
            url: url,
            title: title,
            content: content,
            timestamp: Date()
        )

        processedData.append(data)
        return data
    }

    func getAllProcessedData() async -> [ScrapedData] {
        return processedData
    }

    private func extractTitle(from html: String) -> String {
        // Basic title extraction
        let pattern = #"<title[^>]*>(.*?)</title>"#
        let regex = try? NSRegularExpression(pattern: pattern, options: .caseInsensitive)
        let nsString = html as NSString
        let range = NSRange(location: 0, length: nsString.length)

        if let match = regex?.firstMatch(in: html, options: [], range: range) {
            return nsString.substring(with: match.range(at: 1))
        }

        return "No title found"
    }

    private func extractContent(from html: String) -> String {
        // Remove HTML tags (basic implementation)
        return html.replacingOccurrences(of: "<[^>]+>", with: "", options: .regularExpression)
    }
}

Complete Example: Concurrent Web Scraper

Here's a complete example that demonstrates concurrent web scraping with proper error handling:

import Foundation

class ConcurrentWebScraper {
    private let scraper = WebScraper()
    private let processor = DataProcessor()
    private let rateLimitedScraper = RateLimitedScraper()

    func scrapeAndProcess(urls: [URL]) async -> [ScrapedData] {
        print("Starting concurrent scraping of \(urls.count) URLs...")

        let results = await scraper.scrapeMultipleWithRetry(urls)
        var processedData: [ScrapedData] = []

        await withTaskGroup(of: ScrapedData?.self) { group in
            for (url, result) in results {
                group.addTask {
                    switch result {
                    case .success(let html):
                        return await self.processor.processHTML(html, from: url)
                    case .failure(let error):
                        print("Failed to scrape \(url): \(error)")
                        return nil
                    }
                }
            }

            for await data in group {
                if let data = data {
                    processedData.append(data)
                }
            }
        }

        print("Completed scraping. Processed \(processedData.count) pages successfully.")
        return processedData
    }
}

// Usage example
func runScraper() async {
    let urls = [
        URL(string: "https://example.com")!,
        URL(string: "https://httpbin.org/html")!,
        URL(string: "https://httpbin.org/json")!
    ]

    let scraper = ConcurrentWebScraper()
    let results = await scraper.scrapeAndProcess(urls: urls)

    for data in results {
        print("Title: \(data.title)")
        print("URL: \(data.url)")
        print("Content length: \(data.content.count) characters")
        print("---")
    }
}

// Run the scraper
Task {
    await runScraper()
}

Performance Optimization Tips

1. Connection Pool Configuration

let config = URLSessionConfiguration.default
config.httpMaximumConnectionsPerHost = 6
config.requestCachePolicy = .useProtocolCachePolicy
config.urlCache = URLCache(memoryCapacity: 50 * 1024 * 1024, diskCapacity: 100 * 1024 * 1024)

2. Memory Management

// Use weak references in closures to avoid retain cycles
weak var weakSelf = self
group.addTask { [weak weakSelf] in
    return await weakSelf?.scrapeURL(url)
}

3. Resource Monitoring

actor ResourceMonitor {
    private var activeRequests = 0
    private var totalRequests = 0

    func trackRequest() {
        activeRequests += 1
        totalRequests += 1
    }

    func completeRequest() {
        activeRequests = max(0, activeRequests - 1)
    }

    func getStats() -> (active: Int, total: Int) {
        return (activeRequests, totalRequests)
    }
}

Best Practices for Concurrent Web Scraping

  1. Respect Rate Limits: Always implement proper rate limiting to avoid overwhelming target servers
  2. Handle Errors Gracefully: Use proper error handling and retry mechanisms
  3. Monitor Memory Usage: Be aware of memory consumption when processing large amounts of data
  4. Use Appropriate Timeouts: Set reasonable timeout values for network requests
  5. Implement Logging: Add comprehensive logging for debugging and monitoring

Similar to how you might handle multiple pages in parallel with Puppeteer, Swift's concurrent model provides excellent performance for web scraping tasks while maintaining clean, readable code.

Conclusion

Concurrent web scraping in Swift leverages the language's modern async/await syntax and actor model to create efficient, safe, and maintainable scraping solutions. By combining URLSession's robust networking capabilities with Swift's concurrency features, you can build scalable web scraping applications that handle multiple requests efficiently while respecting server limitations and handling errors gracefully.

The key to successful concurrent web scraping lies in balancing performance with responsibility – scrape efficiently but respectfully, implement proper error handling, and always monitor your resource usage to ensure optimal performance.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon