Table of contents

How do I handle pagination when scraping websites with Swift?

Handling pagination is a crucial aspect of web scraping when you need to extract data from multiple pages. Swift provides powerful tools like URLSession and async/await that make pagination handling efficient and maintainable. This guide covers various pagination patterns and implementation strategies for Swift web scraping.

Understanding Common Pagination Patterns

Before implementing pagination logic, it's important to understand the different types of pagination you'll encounter:

1. URL-Based Pagination

The most common pattern where page numbers or offsets are included in the URL: - https://example.com/products?page=1 - https://example.com/api/items?offset=0&limit=20 - https://example.com/search?query=swift&p=2

2. Token-Based Pagination

APIs often use tokens or cursors for pagination: - https://api.example.com/data?next_token=abc123 - https://api.example.com/items?cursor=xyz789

3. Load More Buttons

Pages that use JavaScript to load additional content when a button is clicked or when scrolling reaches the bottom.

Basic Pagination Implementation with URLSession

Here's a foundational approach to handle URL-based pagination in Swift:

import Foundation

class WebScraper {
    private let session = URLSession.shared

    func scrapeAllPages(baseURL: String, maxPages: Int = 10) async throws -> [Data] {
        var allData: [Data] = []
        var currentPage = 1

        while currentPage <= maxPages {
            let url = "\(baseURL)?page=\(currentPage)"

            guard let requestURL = URL(string: url) else {
                throw ScrapingError.invalidURL
            }

            do {
                let (data, response) = try await session.data(from: requestURL)

                // Check if we got a valid response
                guard let httpResponse = response as? HTTPURLResponse,
                      httpResponse.statusCode == 200 else {
                    throw ScrapingError.invalidResponse
                }

                // Check if page has content
                if let jsonData = try? JSONSerialization.jsonObject(with: data) as? [String: Any],
                   let items = jsonData["items"] as? [[String: Any]],
                   !items.isEmpty {
                    allData.append(data)
                    currentPage += 1
                } else {
                    // No more data, break the loop
                    break
                }

                // Add delay to be respectful to the server
                try await Task.sleep(nanoseconds: 1_000_000_000) // 1 second

            } catch {
                print("Error scraping page \(currentPage): \(error)")
                break
            }
        }

        return allData
    }
}

enum ScrapingError: Error {
    case invalidURL
    case invalidResponse
    case noMoreData
}

Advanced Pagination with JSON Response Parsing

For API-based scraping, you'll often need to parse JSON responses to determine pagination information:

struct PaginatedResponse: Codable {
    let items: [Item]
    let pagination: PaginationInfo
}

struct PaginationInfo: Codable {
    let currentPage: Int
    let totalPages: Int
    let hasNext: Bool
    let nextToken: String?
}

struct Item: Codable {
    let id: String
    let title: String
    let description: String
}

class APIScraper {
    private let session = URLSession.shared
    private let decoder = JSONDecoder()

    func scrapeAllItems(from baseURL: String) async throws -> [Item] {
        var allItems: [Item] = []
        var currentPage = 1
        var hasMorePages = true

        while hasMorePages {
            let url = "\(baseURL)?page=\(currentPage)"

            guard let requestURL = URL(string: url) else {
                throw ScrapingError.invalidURL
            }

            let (data, _) = try await session.data(from: requestURL)
            let response = try decoder.decode(PaginatedResponse.self, from: data)

            allItems.append(contentsOf: response.items)

            hasMorePages = response.pagination.hasNext
            currentPage += 1

            // Respect rate limits
            try await Task.sleep(nanoseconds: 500_000_000) // 0.5 seconds
        }

        return allItems
    }
}

Token-Based Pagination Implementation

Some APIs use tokens or cursors instead of page numbers. Here's how to handle this pattern:

class TokenBasedScraper {
    private let session = URLSession.shared
    private let decoder = JSONDecoder()

    func scrapeWithTokens(from baseURL: String) async throws -> [Item] {
        var allItems: [Item] = []
        var nextToken: String? = nil

        repeat {
            var urlComponents = URLComponents(string: baseURL)!

            if let token = nextToken {
                urlComponents.queryItems = [URLQueryItem(name: "next_token", value: token)]
            }

            guard let url = urlComponents.url else {
                throw ScrapingError.invalidURL
            }

            let (data, _) = try await session.data(from: url)
            let response = try decoder.decode(TokenPaginatedResponse.self, from: data)

            allItems.append(contentsOf: response.items)
            nextToken = response.nextToken

            // Add delay between requests
            try await Task.sleep(nanoseconds: 1_000_000_000)

        } while nextToken != nil

        return allItems
    }
}

struct TokenPaginatedResponse: Codable {
    let items: [Item]
    let nextToken: String?
}

Handling Dynamic Pagination with Custom Headers

Some websites require specific headers or authentication tokens. Here's an enhanced version:

class AuthenticatedScraper {
    private let session: URLSession
    private let decoder = JSONDecoder()

    init(apiKey: String) {
        let config = URLSessionConfiguration.default
        config.httpAdditionalHeaders = [
            "Authorization": "Bearer \(apiKey)",
            "User-Agent": "SwiftScraper/1.0"
        ]
        self.session = URLSession(configuration: config)
    }

    func scrapePaginatedData(from endpoint: String, 
                           pageSize: Int = 20) async throws -> [Item] {
        var allItems: [Item] = []
        var offset = 0
        var hasMoreData = true

        while hasMoreData {
            let url = "\(endpoint)?limit=\(pageSize)&offset=\(offset)"

            guard let requestURL = URL(string: url) else {
                throw ScrapingError.invalidURL
            }

            var request = URLRequest(url: requestURL)
            request.httpMethod = "GET"
            request.setValue("application/json", forHTTPHeaderField: "Accept")

            let (data, response) = try await session.data(for: request)

            guard let httpResponse = response as? HTTPURLResponse else {
                throw ScrapingError.invalidResponse
            }

            switch httpResponse.statusCode {
            case 200:
                let paginatedResponse = try decoder.decode(PaginatedResponse.self, from: data)
                allItems.append(contentsOf: paginatedResponse.items)

                // Check if we've reached the end
                hasMoreData = paginatedResponse.items.count == pageSize
                offset += pageSize

            case 429:
                // Rate limited, wait and retry
                print("Rate limited, waiting...")
                try await Task.sleep(nanoseconds: 5_000_000_000) // 5 seconds
                continue

            default:
                throw ScrapingError.httpError(httpResponse.statusCode)
            }

            // Respectful delay
            try await Task.sleep(nanoseconds: 1_000_000_000)
        }

        return allItems
    }
}

extension ScrapingError {
    static func httpError(_ code: Int) -> ScrapingError {
        // Handle different HTTP error codes
        return .invalidResponse
    }
}

Concurrent Pagination for Better Performance

For better performance, you can implement concurrent pagination when the total number of pages is known:

class ConcurrentPaginationScraper {
    private let session = URLSession.shared
    private let maxConcurrentRequests = 3

    func scrapeAllPagesConcurrently(baseURL: String, 
                                  totalPages: Int) async throws -> [Item] {
        var allItems: [Item] = []

        // Create tasks for all pages
        let tasks = (1...totalPages).map { page in
            Task {
                return try await scrapeSinglePage(baseURL: baseURL, page: page)
            }
        }

        // Process results as they complete
        for task in tasks {
            do {
                let pageItems = try await task.value
                allItems.append(contentsOf: pageItems)
            } catch {
                print("Failed to scrape page: \(error)")
            }
        }

        return allItems.sorted { $0.id < $1.id } // Sort if order matters
    }

    private func scrapeSinglePage(baseURL: String, page: Int) async throws -> [Item] {
        let url = "\(baseURL)?page=\(page)"

        guard let requestURL = URL(string: url) else {
            throw ScrapingError.invalidURL
        }

        let (data, _) = try await session.data(from: requestURL)
        let response = try JSONDecoder().decode(PaginatedResponse.self, from: data)

        return response.items
    }
}

Error Handling and Retry Logic

Robust pagination handling requires proper error handling and retry mechanisms:

extension WebScraper {
    func scrapeWithRetry(url: String, maxRetries: Int = 3) async throws -> Data {
        var lastError: Error?

        for attempt in 1...maxRetries {
            do {
                guard let requestURL = URL(string: url) else {
                    throw ScrapingError.invalidURL
                }

                let (data, response) = try await session.data(from: requestURL)

                guard let httpResponse = response as? HTTPURLResponse else {
                    throw ScrapingError.invalidResponse
                }

                if httpResponse.statusCode == 200 {
                    return data
                } else if httpResponse.statusCode == 429 {
                    // Rate limited, wait longer
                    let delay = UInt64(attempt * 2 * 1_000_000_000) // Exponential backoff
                    try await Task.sleep(nanoseconds: delay)
                    continue
                } else {
                    throw ScrapingError.httpError(httpResponse.statusCode)
                }

            } catch {
                lastError = error
                print("Attempt \(attempt) failed: \(error)")

                if attempt < maxRetries {
                    let delay = UInt64(attempt * 1_000_000_000) // 1, 2, 3 seconds
                    try await Task.sleep(nanoseconds: delay)
                }
            }
        }

        throw lastError ?? ScrapingError.invalidResponse
    }
}

Practical Usage Example

Here's a complete example that demonstrates pagination handling in a real-world scenario:

import Foundation

@main
struct PaginationExample {
    static func main() async {
        let scraper = WebScraper()

        do {
            print("Starting pagination scraping...")
            let allData = try await scraper.scrapeAllPages(
                baseURL: "https://api.example.com/products",
                maxPages: 50
            )

            print("Successfully scraped \(allData.count) pages")

            // Process the scraped data
            for (index, pageData) in allData.enumerated() {
                print("Processing page \(index + 1)...")
                // Parse and process each page's data
            }

        } catch {
            print("Scraping failed: \(error)")
        }
    }
}

Best Practices for Pagination in Swift

  1. Respect Rate Limits: Always add delays between requests to avoid being blocked
  2. Handle Errors Gracefully: Implement retry logic for transient failures
  3. Use Async/Await: Leverage Swift's modern concurrency features for cleaner code
  4. Monitor Memory Usage: For large datasets, consider processing pages individually rather than storing everything in memory
  5. Implement Logging: Track your scraping progress and any issues that occur
  6. Follow robots.txt: Always check and respect the website's robots.txt file

For scenarios involving complex JavaScript-rendered pagination, you might need to consider browser automation tools that can handle dynamic content loading, similar to how AJAX requests are handled in browser automation.

When dealing with authentication-protected paginated content, ensure you properly manage sessions and cookies throughout the pagination process to maintain access to protected resources.

Remember to always test your pagination logic thoroughly, as websites can change their pagination structure, and having robust error handling will help your scraper adapt to these changes gracefully.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon