Table of contents

How do I make HTTP requests in Swift for web scraping?

Making HTTP requests in Swift for web scraping is primarily accomplished using the built-in URLSession framework. Swift provides modern async/await syntax for handling network operations, making it easier to write clean, readable code for web scraping tasks.

Basic HTTP Request with URLSession

The foundation of web scraping in Swift is the URLSession class, which provides a comprehensive API for making HTTP requests:

import Foundation

func fetchWebPage(from urlString: String) async throws -> String {
    guard let url = URL(string: urlString) else {
        throw URLError(.badURL)
    }

    let (data, response) = try await URLSession.shared.data(from: url)

    guard let httpResponse = response as? HTTPURLResponse,
          httpResponse.statusCode == 200 else {
        throw URLError(.badServerResponse)
    }

    return String(data: data, encoding: .utf8) ?? ""
}

// Usage
Task {
    do {
        let html = try await fetchWebPage(from: "https://example.com")
        print(html)
    } catch {
        print("Error: \(error)")
    }
}

Advanced URLSession Configuration

For more sophisticated web scraping scenarios, you'll need to configure URLSession with custom settings:

import Foundation

class WebScraper {
    private let session: URLSession

    init() {
        let config = URLSessionConfiguration.default
        config.timeoutIntervalForRequest = 30
        config.timeoutIntervalForResource = 60
        config.httpMaximumConnectionsPerHost = 5
        config.requestCachePolicy = .reloadIgnoringLocalCacheData

        self.session = URLSession(configuration: config)
    }

    func scrapeURL(_ urlString: String, headers: [String: String] = [:]) async throws -> (Data, HTTPURLResponse) {
        guard let url = URL(string: urlString) else {
            throw URLError(.badURL)
        }

        var request = URLRequest(url: url)
        request.httpMethod = "GET"

        // Add custom headers
        for (key, value) in headers {
            request.setValue(value, forHTTPHeaderField: key)
        }

        // Add user agent
        request.setValue("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36", 
                        forHTTPHeaderField: "User-Agent")

        let (data, response) = try await session.data(for: request)

        guard let httpResponse = response as? HTTPURLResponse else {
            throw URLError(.badServerResponse)
        }

        return (data, httpResponse)
    }
}

Handling Different Response Types

Web scraping often requires handling various content types and response formats:

extension WebScraper {
    func scrapeJSON<T: Codable>(from urlString: String, as type: T.Type) async throws -> T {
        let (data, response) = try await scrapeURL(urlString)

        guard response.statusCode == 200 else {
            throw URLError(.badServerResponse)
        }

        let decoder = JSONDecoder()
        return try decoder.decode(type, from: data)
    }

    func scrapeHTML(from urlString: String) async throws -> String {
        let (data, response) = try await scrapeURL(urlString)

        guard response.statusCode == 200 else {
            throw URLError(.badServerResponse)
        }

        guard let html = String(data: data, encoding: .utf8) else {
            throw URLError(.cannotDecodeContentData)
        }

        return html
    }

    func downloadImage(from urlString: String) async throws -> Data {
        let (data, response) = try await scrapeURL(urlString)

        guard response.statusCode == 200,
              let contentType = response.value(forHTTPHeaderField: "Content-Type"),
              contentType.hasPrefix("image/") else {
            throw URLError(.badServerResponse)
        }

        return data
    }
}

Error Handling and Retry Logic

Robust web scraping requires comprehensive error handling and retry mechanisms:

extension WebScraper {
    func scrapeWithRetry(
        _ urlString: String,
        maxRetries: Int = 3,
        delay: TimeInterval = 1.0
    ) async throws -> String {
        var lastError: Error?

        for attempt in 0...maxRetries {
            do {
                let html = try await scrapeHTML(from: urlString)
                return html
            } catch {
                lastError = error

                if attempt < maxRetries {
                    print("Attempt \(attempt + 1) failed, retrying in \(delay) seconds...")
                    try await Task.sleep(nanoseconds: UInt64(delay * 1_000_000_000))
                }
            }
        }

        throw lastError ?? URLError(.unknown)
    }
}

Handling Cookies and Sessions

Many websites require session management and cookie handling:

class SessionAwareScraper {
    private let session: URLSession
    private let cookieStorage: HTTPCookieStorage

    init() {
        cookieStorage = HTTPCookieStorage()

        let config = URLSessionConfiguration.default
        config.httpCookieStorage = cookieStorage
        config.httpCookieAcceptPolicy = .always

        session = URLSession(configuration: config)
    }

    func login(username: String, password: String, loginURL: String) async throws {
        guard let url = URL(string: loginURL) else {
            throw URLError(.badURL)
        }

        var request = URLRequest(url: url)
        request.httpMethod = "POST"
        request.setValue("application/x-www-form-urlencoded", forHTTPHeaderField: "Content-Type")

        let body = "username=\(username)&password=\(password)"
        request.httpBody = body.data(using: .utf8)

        let (_, response) = try await session.data(for: request)

        guard let httpResponse = response as? HTTPURLResponse,
              httpResponse.statusCode == 200 else {
            throw URLError(.userAuthenticationRequired)
        }
    }

    func scrapeProtectedPage(_ urlString: String) async throws -> String {
        let (data, _) = try await scrapeURL(urlString)
        return String(data: data, encoding: .utf8) ?? ""
    }
}

Parsing HTML Content

For parsing HTML content, you can use SwiftSoup, a Swift port of the popular Java HTML parser:

import SwiftSoup

extension WebScraper {
    func extractLinks(from html: String) throws -> [String] {
        let doc = try SwiftSoup.parse(html)
        let links = try doc.select("a[href]")

        return try links.compactMap { element in
            try element.attr("href")
        }
    }

    func extractText(from html: String, selector: String) throws -> [String] {
        let doc = try SwiftSoup.parse(html)
        let elements = try doc.select(selector)

        return try elements.map { element in
            try element.text()
        }
    }

    func extractImages(from html: String) throws -> [String] {
        let doc = try SwiftSoup.parse(html)
        let images = try doc.select("img[src]")

        return try images.compactMap { element in
            try element.attr("src")
        }
    }
}

Concurrent Scraping

For efficient web scraping, you can implement concurrent requests:

actor ScrapingCoordinator {
    private var activeRequests = 0
    private let maxConcurrentRequests: Int

    init(maxConcurrentRequests: Int = 5) {
        self.maxConcurrentRequests = maxConcurrentRequests
    }

    func canMakeRequest() -> Bool {
        return activeRequests < maxConcurrentRequests
    }

    func requestStarted() {
        activeRequests += 1
    }

    func requestCompleted() {
        activeRequests = max(0, activeRequests - 1)
    }
}

extension WebScraper {
    func scrapeMultipleURLs(_ urls: [String]) async throws -> [String] {
        let coordinator = ScrapingCoordinator()

        return try await withThrowingTaskGroup(of: String.self) { group in
            var results: [String] = []

            for url in urls {
                // Wait for available slot
                while await !coordinator.canMakeRequest() {
                    try await Task.sleep(nanoseconds: 100_000_000) // 0.1 seconds
                }

                await coordinator.requestStarted()

                group.addTask {
                    defer {
                        Task {
                            await coordinator.requestCompleted()
                        }
                    }
                    return try await self.scrapeHTML(from: url)
                }
            }

            for try await result in group {
                results.append(result)
            }

            return results
        }
    }
}

Complete Example: Web Scraper Class

Here's a comprehensive example that combines all the concepts:

import Foundation
import SwiftSoup

class ComprehensiveWebScraper {
    private let session: URLSession
    private let coordinator: ScrapingCoordinator

    init(maxConcurrentRequests: Int = 5) {
        let config = URLSessionConfiguration.default
        config.timeoutIntervalForRequest = 30
        config.httpCookieAcceptPolicy = .always

        session = URLSession(configuration: config)
        coordinator = ScrapingCoordinator(maxConcurrentRequests: maxConcurrentRequests)
    }

    func scrapeWebsite(
        url: String,
        headers: [String: String] = [:],
        maxRetries: Int = 3
    ) async throws -> ScrapingResult {
        let html = try await scrapeWithRetry(url, headers: headers, maxRetries: maxRetries)

        let doc = try SwiftSoup.parse(html)
        let title = try doc.title()
        let links = try extractLinks(from: html)
        let images = try extractImages(from: html)

        return ScrapingResult(
            url: url,
            title: title,
            html: html,
            links: links,
            images: images
        )
    }

    private func scrapeWithRetry(
        _ urlString: String,
        headers: [String: String],
        maxRetries: Int
    ) async throws -> String {
        var lastError: Error?

        for attempt in 0...maxRetries {
            do {
                while await !coordinator.canMakeRequest() {
                    try await Task.sleep(nanoseconds: 100_000_000)
                }

                await coordinator.requestStarted()
                defer {
                    Task {
                        await coordinator.requestCompleted()
                    }
                }

                return try await performRequest(urlString, headers: headers)
            } catch {
                lastError = error
                if attempt < maxRetries {
                    try await Task.sleep(nanoseconds: 1_000_000_000) // 1 second
                }
            }
        }

        throw lastError ?? URLError(.unknown)
    }

    private func performRequest(_ urlString: String, headers: [String: String]) async throws -> String {
        guard let url = URL(string: urlString) else {
            throw URLError(.badURL)
        }

        var request = URLRequest(url: url)
        request.httpMethod = "GET"

        for (key, value) in headers {
            request.setValue(value, forHTTPHeaderField: key)
        }

        request.setValue("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36", 
                        forHTTPHeaderField: "User-Agent")

        let (data, response) = try await session.data(for: request)

        guard let httpResponse = response as? HTTPURLResponse,
              httpResponse.statusCode == 200 else {
            throw URLError(.badServerResponse)
        }

        return String(data: data, encoding: .utf8) ?? ""
    }

    private func extractLinks(from html: String) throws -> [String] {
        let doc = try SwiftSoup.parse(html)
        let links = try doc.select("a[href]")
        return try links.compactMap { try $0.attr("href") }
    }

    private func extractImages(from html: String) throws -> [String] {
        let doc = try SwiftSoup.parse(html)
        let images = try doc.select("img[src]")
        return try images.compactMap { try $0.attr("src") }
    }
}

struct ScrapingResult {
    let url: String
    let title: String
    let html: String
    let links: [String]
    let images: [String]
}

Best Practices for Swift Web Scraping

  1. Respect robots.txt: Always check the website's robots.txt file before scraping
  2. Implement rate limiting: Use delays between requests to avoid overwhelming servers
  3. Handle errors gracefully: Implement comprehensive error handling and retry logic
  4. Use appropriate headers: Set realistic User-Agent strings and other headers
  5. Manage memory: For large-scale scraping, ensure proper memory management
  6. Consider legal implications: Always comply with website terms of service and legal requirements

Alternative Approaches

While URLSession is the standard approach, for JavaScript-heavy websites, you might need to consider using tools that can execute JavaScript. For simpler HTTP requests without JavaScript execution, monitoring network requests in Puppeteer can provide insights into how complex websites load data, which can inform your Swift scraping strategy.

For handling dynamic content and complex interactions, you might also want to study how to handle AJAX requests using Puppeteer to understand the patterns of modern web applications.

Swift's URLSession provides a powerful foundation for web scraping, offering excellent performance and native integration with iOS and macOS applications. By combining proper error handling, concurrent execution, and HTML parsing libraries like SwiftSoup, you can build robust and efficient web scrapers that handle most scraping scenarios effectively.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon