How do I handle pagination when scraping websites with Swift?
Handling pagination is a crucial aspect of web scraping when you need to extract data from multiple pages. Swift provides powerful tools like URLSession and async/await that make pagination handling efficient and maintainable. This guide covers various pagination patterns and implementation strategies for Swift web scraping.
Understanding Common Pagination Patterns
Before implementing pagination logic, it's important to understand the different types of pagination you'll encounter:
1. URL-Based Pagination
The most common pattern where page numbers or offsets are included in the URL:
- https://example.com/products?page=1
- https://example.com/api/items?offset=0&limit=20
- https://example.com/search?query=swift&p=2
2. Token-Based Pagination
APIs often use tokens or cursors for pagination:
- https://api.example.com/data?next_token=abc123
- https://api.example.com/items?cursor=xyz789
3. Load More Buttons
Pages that use JavaScript to load additional content when a button is clicked or when scrolling reaches the bottom.
Basic Pagination Implementation with URLSession
Here's a foundational approach to handle URL-based pagination in Swift:
import Foundation
class WebScraper {
private let session = URLSession.shared
func scrapeAllPages(baseURL: String, maxPages: Int = 10) async throws -> [Data] {
var allData: [Data] = []
var currentPage = 1
while currentPage <= maxPages {
let url = "\(baseURL)?page=\(currentPage)"
guard let requestURL = URL(string: url) else {
throw ScrapingError.invalidURL
}
do {
let (data, response) = try await session.data(from: requestURL)
// Check if we got a valid response
guard let httpResponse = response as? HTTPURLResponse,
httpResponse.statusCode == 200 else {
throw ScrapingError.invalidResponse
}
// Check if page has content
if let jsonData = try? JSONSerialization.jsonObject(with: data) as? [String: Any],
let items = jsonData["items"] as? [[String: Any]],
!items.isEmpty {
allData.append(data)
currentPage += 1
} else {
// No more data, break the loop
break
}
// Add delay to be respectful to the server
try await Task.sleep(nanoseconds: 1_000_000_000) // 1 second
} catch {
print("Error scraping page \(currentPage): \(error)")
break
}
}
return allData
}
}
enum ScrapingError: Error {
case invalidURL
case invalidResponse
case noMoreData
}
Advanced Pagination with JSON Response Parsing
For API-based scraping, you'll often need to parse JSON responses to determine pagination information:
struct PaginatedResponse: Codable {
let items: [Item]
let pagination: PaginationInfo
}
struct PaginationInfo: Codable {
let currentPage: Int
let totalPages: Int
let hasNext: Bool
let nextToken: String?
}
struct Item: Codable {
let id: String
let title: String
let description: String
}
class APIScraper {
private let session = URLSession.shared
private let decoder = JSONDecoder()
func scrapeAllItems(from baseURL: String) async throws -> [Item] {
var allItems: [Item] = []
var currentPage = 1
var hasMorePages = true
while hasMorePages {
let url = "\(baseURL)?page=\(currentPage)"
guard let requestURL = URL(string: url) else {
throw ScrapingError.invalidURL
}
let (data, _) = try await session.data(from: requestURL)
let response = try decoder.decode(PaginatedResponse.self, from: data)
allItems.append(contentsOf: response.items)
hasMorePages = response.pagination.hasNext
currentPage += 1
// Respect rate limits
try await Task.sleep(nanoseconds: 500_000_000) // 0.5 seconds
}
return allItems
}
}
Token-Based Pagination Implementation
Some APIs use tokens or cursors instead of page numbers. Here's how to handle this pattern:
class TokenBasedScraper {
private let session = URLSession.shared
private let decoder = JSONDecoder()
func scrapeWithTokens(from baseURL: String) async throws -> [Item] {
var allItems: [Item] = []
var nextToken: String? = nil
repeat {
var urlComponents = URLComponents(string: baseURL)!
if let token = nextToken {
urlComponents.queryItems = [URLQueryItem(name: "next_token", value: token)]
}
guard let url = urlComponents.url else {
throw ScrapingError.invalidURL
}
let (data, _) = try await session.data(from: url)
let response = try decoder.decode(TokenPaginatedResponse.self, from: data)
allItems.append(contentsOf: response.items)
nextToken = response.nextToken
// Add delay between requests
try await Task.sleep(nanoseconds: 1_000_000_000)
} while nextToken != nil
return allItems
}
}
struct TokenPaginatedResponse: Codable {
let items: [Item]
let nextToken: String?
}
Handling Dynamic Pagination with Custom Headers
Some websites require specific headers or authentication tokens. Here's an enhanced version:
class AuthenticatedScraper {
private let session: URLSession
private let decoder = JSONDecoder()
init(apiKey: String) {
let config = URLSessionConfiguration.default
config.httpAdditionalHeaders = [
"Authorization": "Bearer \(apiKey)",
"User-Agent": "SwiftScraper/1.0"
]
self.session = URLSession(configuration: config)
}
func scrapePaginatedData(from endpoint: String,
pageSize: Int = 20) async throws -> [Item] {
var allItems: [Item] = []
var offset = 0
var hasMoreData = true
while hasMoreData {
let url = "\(endpoint)?limit=\(pageSize)&offset=\(offset)"
guard let requestURL = URL(string: url) else {
throw ScrapingError.invalidURL
}
var request = URLRequest(url: requestURL)
request.httpMethod = "GET"
request.setValue("application/json", forHTTPHeaderField: "Accept")
let (data, response) = try await session.data(for: request)
guard let httpResponse = response as? HTTPURLResponse else {
throw ScrapingError.invalidResponse
}
switch httpResponse.statusCode {
case 200:
let paginatedResponse = try decoder.decode(PaginatedResponse.self, from: data)
allItems.append(contentsOf: paginatedResponse.items)
// Check if we've reached the end
hasMoreData = paginatedResponse.items.count == pageSize
offset += pageSize
case 429:
// Rate limited, wait and retry
print("Rate limited, waiting...")
try await Task.sleep(nanoseconds: 5_000_000_000) // 5 seconds
continue
default:
throw ScrapingError.httpError(httpResponse.statusCode)
}
// Respectful delay
try await Task.sleep(nanoseconds: 1_000_000_000)
}
return allItems
}
}
extension ScrapingError {
static func httpError(_ code: Int) -> ScrapingError {
// Handle different HTTP error codes
return .invalidResponse
}
}
Concurrent Pagination for Better Performance
For better performance, you can implement concurrent pagination when the total number of pages is known:
class ConcurrentPaginationScraper {
private let session = URLSession.shared
private let maxConcurrentRequests = 3
func scrapeAllPagesConcurrently(baseURL: String,
totalPages: Int) async throws -> [Item] {
var allItems: [Item] = []
// Create tasks for all pages
let tasks = (1...totalPages).map { page in
Task {
return try await scrapeSinglePage(baseURL: baseURL, page: page)
}
}
// Process results as they complete
for task in tasks {
do {
let pageItems = try await task.value
allItems.append(contentsOf: pageItems)
} catch {
print("Failed to scrape page: \(error)")
}
}
return allItems.sorted { $0.id < $1.id } // Sort if order matters
}
private func scrapeSinglePage(baseURL: String, page: Int) async throws -> [Item] {
let url = "\(baseURL)?page=\(page)"
guard let requestURL = URL(string: url) else {
throw ScrapingError.invalidURL
}
let (data, _) = try await session.data(from: requestURL)
let response = try JSONDecoder().decode(PaginatedResponse.self, from: data)
return response.items
}
}
Error Handling and Retry Logic
Robust pagination handling requires proper error handling and retry mechanisms:
extension WebScraper {
func scrapeWithRetry(url: String, maxRetries: Int = 3) async throws -> Data {
var lastError: Error?
for attempt in 1...maxRetries {
do {
guard let requestURL = URL(string: url) else {
throw ScrapingError.invalidURL
}
let (data, response) = try await session.data(from: requestURL)
guard let httpResponse = response as? HTTPURLResponse else {
throw ScrapingError.invalidResponse
}
if httpResponse.statusCode == 200 {
return data
} else if httpResponse.statusCode == 429 {
// Rate limited, wait longer
let delay = UInt64(attempt * 2 * 1_000_000_000) // Exponential backoff
try await Task.sleep(nanoseconds: delay)
continue
} else {
throw ScrapingError.httpError(httpResponse.statusCode)
}
} catch {
lastError = error
print("Attempt \(attempt) failed: \(error)")
if attempt < maxRetries {
let delay = UInt64(attempt * 1_000_000_000) // 1, 2, 3 seconds
try await Task.sleep(nanoseconds: delay)
}
}
}
throw lastError ?? ScrapingError.invalidResponse
}
}
Practical Usage Example
Here's a complete example that demonstrates pagination handling in a real-world scenario:
import Foundation
@main
struct PaginationExample {
static func main() async {
let scraper = WebScraper()
do {
print("Starting pagination scraping...")
let allData = try await scraper.scrapeAllPages(
baseURL: "https://api.example.com/products",
maxPages: 50
)
print("Successfully scraped \(allData.count) pages")
// Process the scraped data
for (index, pageData) in allData.enumerated() {
print("Processing page \(index + 1)...")
// Parse and process each page's data
}
} catch {
print("Scraping failed: \(error)")
}
}
}
Best Practices for Pagination in Swift
- Respect Rate Limits: Always add delays between requests to avoid being blocked
- Handle Errors Gracefully: Implement retry logic for transient failures
- Use Async/Await: Leverage Swift's modern concurrency features for cleaner code
- Monitor Memory Usage: For large datasets, consider processing pages individually rather than storing everything in memory
- Implement Logging: Track your scraping progress and any issues that occur
- Follow robots.txt: Always check and respect the website's robots.txt file
For scenarios involving complex JavaScript-rendered pagination, you might need to consider browser automation tools that can handle dynamic content loading, similar to how AJAX requests are handled in browser automation.
When dealing with authentication-protected paginated content, ensure you properly manage sessions and cookies throughout the pagination process to maintain access to protected resources.
Remember to always test your pagination logic thoroughly, as websites can change their pagination structure, and having robust error handling will help your scraper adapt to these changes gracefully.