Table of contents

What Error Handling Patterns Work Best with SwiftSoup?

SwiftSoup is a powerful Swift library for parsing HTML and XML documents, but like any web scraping tool, it requires robust error handling to manage various failure scenarios. Implementing proper error handling patterns ensures your SwiftSoup-based applications are reliable, maintainable, and provide meaningful feedback when issues occur.

Understanding SwiftSoup Error Types

SwiftSoup throws several types of errors that you need to handle appropriately:

  • ParseError: Thrown when HTML parsing fails
  • SelectorParseException: Thrown when CSS selectors are malformed
  • IOException: Thrown during network operations or file I/O
  • IllegalArgumentException: Thrown when invalid arguments are passed

1. Basic Do-Catch Error Handling

The most fundamental error handling pattern in SwiftSoup uses Swift's do-catch mechanism:

import SwiftSoup

func parseHTML(from htmlString: String) -> String? {
    do {
        let doc = try SwiftSoup.parse(htmlString)
        let title = try doc.title()
        return title
    } catch Exception.Error(let type, let message) {
        print("SwiftSoup error: \(type) - \(message)")
        return nil
    } catch {
        print("Unexpected error: \(error)")
        return nil
    }
}

// Usage
let html = "<html><head><title>Sample Page</title></head><body></body></html>"
if let title = parseHTML(from: html) {
    print("Page title: \(title)")
} else {
    print("Failed to extract title")
}

2. Specific Error Type Handling

For more granular control, handle specific SwiftSoup error types:

func extractDataWithSpecificErrorHandling(from html: String) -> [String] {
    var results: [String] = []

    do {
        let doc = try SwiftSoup.parse(html)
        let elements = try doc.select("div.content")

        for element in elements {
            let text = try element.text()
            results.append(text)
        }

    } catch Exception.Error(let type, let message) where type == .SelectorParseException {
        print("Invalid CSS selector: \(message)")
        // Fallback to basic parsing
        return fallbackParsing(html)

    } catch Exception.Error(let type, let message) where type == .ParseError {
        print("HTML parsing failed: \(message)")
        // Try alternative parsing strategies
        return tryAlternativeParsing(html)

    } catch {
        print("Unexpected error during parsing: \(error)")
    }

    return results
}

func fallbackParsing(_ html: String) -> [String] {
    // Implement simpler parsing strategy
    do {
        let doc = try SwiftSoup.parse(html)
        let body = try doc.body()
        return [try body?.text() ?? ""]
    } catch {
        return ["Failed to parse content"]
    }
}

3. Result Type Pattern

Using Swift's Result type provides a more functional approach to error handling:

enum ScrapingError: Error {
    case parsingFailed(String)
    case selectionFailed(String)
    case networkError(String)
    case emptyContent
}

func scrapeWebContent(url: String) -> Result<[String], ScrapingError> {
    do {
        // Simulate network request
        guard let htmlContent = fetchHTML(from: url) else {
            return .failure(.networkError("Failed to fetch content from \(url)"))
        }

        let doc = try SwiftSoup.parse(htmlContent)
        let articles = try doc.select("article")

        guard !articles.isEmpty() else {
            return .failure(.emptyContent)
        }

        var content: [String] = []
        for article in articles {
            let text = try article.text()
            content.append(text)
        }

        return .success(content)

    } catch Exception.Error(let type, let message) {
        switch type {
        case .ParseError:
            return .failure(.parsingFailed("HTML parsing error: \(message)"))
        case .SelectorParseException:
            return .failure(.selectionFailed("CSS selector error: \(message)"))
        default:
            return .failure(.parsingFailed("Unknown SwiftSoup error: \(message)"))
        }
    } catch {
        return .failure(.parsingFailed("Unexpected error: \(error)"))
    }
}

// Usage with Result pattern
let result = scrapeWebContent(url: "https://example.com")
switch result {
case .success(let articles):
    print("Successfully scraped \(articles.count) articles")
case .failure(let error):
    print("Scraping failed: \(error)")
}

4. Optional Chaining with Error Recovery

Combine optional binding with error handling for graceful degradation:

func extractMetadata(from html: String) -> WebPageMetadata {
    var metadata = WebPageMetadata()

    do {
        let doc = try SwiftSoup.parse(html)

        // Safe extraction with fallbacks
        metadata.title = try? doc.title()
        metadata.description = try? doc.select("meta[name=description]").first()?.attr("content")
        metadata.keywords = try? doc.select("meta[name=keywords]").first()?.attr("content")

        // Extract social media tags with error recovery
        if let ogTitle = try? doc.select("meta[property=og:title]").first()?.attr("content"),
           !ogTitle.isEmpty {
            metadata.socialTitle = ogTitle
        } else {
            metadata.socialTitle = metadata.title // Fallback to regular title
        }

    } catch {
        print("Error parsing metadata: \(error)")
        // Return partially populated metadata instead of failing completely
    }

    return metadata
}

struct WebPageMetadata {
    var title: String?
    var description: String?
    var keywords: String?
    var socialTitle: String?
}

5. Async Error Handling Patterns

For network-based scraping, combine async/await with proper error handling:

import Foundation

actor WebScraper {
    func scrapeAsync(url: URL) async throws -> ScrapedData {
        do {
            let (data, response) = try await URLSession.shared.data(from: url)

            guard let httpResponse = response as? HTTPURLResponse,
                  200...299 ~= httpResponse.statusCode else {
                throw ScrapingError.networkError("HTTP error: \(response)")
            }

            let htmlString = String(data: data, encoding: .utf8) ?? ""
            let doc = try SwiftSoup.parse(htmlString)

            return try await parseDocument(doc)

        } catch let error as ScrapingError {
            throw error
        } catch Exception.Error(let type, let message) {
            throw ScrapingError.parsingFailed("SwiftSoup error: \(type) - \(message)")
        } catch {
            throw ScrapingError.networkError("Network error: \(error)")
        }
    }

    private func parseDocument(_ doc: Document) async throws -> ScrapedData {
        // Implement parsing logic with proper error handling
        let title = try doc.title()
        let content = try doc.body()?.text() ?? ""

        return ScrapedData(title: title, content: content)
    }
}

struct ScrapedData {
    let title: String
    let content: String
}

6. Logging and Monitoring Patterns

Implement comprehensive logging for debugging and monitoring:

import os.log

class SwiftSoupLogger {
    static let subsystem = "com.yourapp.webscraping"
    static let category = "swiftsoup"
    static let logger = Logger(subsystem: subsystem, category: category)
}

func parseWithLogging(html: String, selector: String) -> [Element] {
    SwiftSoupLogger.logger.info("Starting HTML parsing with selector: \(selector)")

    do {
        let doc = try SwiftSoup.parse(html)
        let elements = try doc.select(selector)

        SwiftSoupLogger.logger.info("Successfully parsed \(elements.count) elements")
        return elements

    } catch Exception.Error(let type, let message) {
        SwiftSoupLogger.logger.error("SwiftSoup error - Type: \(String(describing: type)), Message: \(message)")

        // Log additional context
        SwiftSoupLogger.logger.debug("HTML length: \(html.count), Selector: \(selector)")

        return []
    } catch {
        SwiftSoupLogger.logger.fault("Unexpected error: \(error.localizedDescription)")
        return []
    }
}

7. Custom Error Types and Extensions

Create domain-specific error types for better error handling:

enum WebScrapingError: Error, LocalizedError {
    case invalidHTML
    case selectorNotFound(String)
    case elementNotFound(String)
    case extractionFailed(String, underlying: Error)

    var errorDescription: String? {
        switch self {
        case .invalidHTML:
            return "The provided HTML is invalid or malformed"
        case .selectorNotFound(let selector):
            return "No elements found for selector: \(selector)"
        case .elementNotFound(let element):
            return "Required element not found: \(element)"
        case .extractionFailed(let operation, let underlying):
            return "Failed to \(operation): \(underlying.localizedDescription)"
        }
    }
}

extension Document {
    func safeSelect(_ cssQuery: String) throws -> Elements {
        do {
            let elements = try select(cssQuery)
            guard !elements.isEmpty() else {
                throw WebScrapingError.selectorNotFound(cssQuery)
            }
            return elements
        } catch Exception.Error(let type, let message) {
            throw WebScrapingError.extractionFailed("select elements", 
                                                   underlying: Exception.Error(type: type, Message: message))
        }
    }
}

8. Retry Mechanisms with Exponential Backoff

Implement retry logic for transient failures:

class RetryableParser {
    private let maxRetries: Int
    private let baseDelay: TimeInterval

    init(maxRetries: Int = 3, baseDelay: TimeInterval = 1.0) {
        self.maxRetries = maxRetries
        self.baseDelay = baseDelay
    }

    func parseWithRetry(html: String, selector: String) async throws -> [String] {
        var lastError: Error?

        for attempt in 0..<maxRetries {
            do {
                let doc = try SwiftSoup.parse(html)
                let elements = try doc.select(selector)

                var results: [String] = []
                for element in elements {
                    let text = try element.text()
                    results.append(text)
                }

                return results

            } catch Exception.Error(let type, let message) where type == .SelectorParseException {
                // Don't retry selector parse errors - they won't succeed
                throw Exception.Error(type: type, Message: message)

            } catch {
                lastError = error

                if attempt < maxRetries - 1 {
                    let delay = baseDelay * pow(2.0, Double(attempt))
                    try await Task.sleep(nanoseconds: UInt64(delay * 1_000_000_000))
                }
            }
        }

        throw lastError ?? WebScrapingError.extractionFailed("parse after retries", underlying: NSError(domain: "RetryExhausted", code: -1))
    }
}

9. Testing Error Scenarios

Create comprehensive tests for error handling:

import XCTest
@testable import YourModule

class SwiftSoupErrorHandlingTests: XCTestCase {

    func testParsingMalformedHTML() {
        let malformedHTML = "<html><head><title>Test</title><body><p>Unclosed paragraph"

        XCTAssertNoThrow(try SwiftSoup.parse(malformedHTML))
    }

    func testInvalidSelector() {
        let html = "<html><body><p>Test</p></body></html>"

        do {
            let doc = try SwiftSoup.parse(html)
            _ = try doc.select("invalid[[selector")
            XCTFail("Expected selector parse exception")
        } catch Exception.Error(let type, _) {
            XCTAssertEqual(type, .SelectorParseException)
        } catch {
            XCTFail("Unexpected error type: \(error)")
        }
    }

    func testEmptyResultHandling() {
        let html = "<html><body><p>Test</p></body></html>"

        do {
            let doc = try SwiftSoup.parse(html)
            let elements = try doc.select("nonexistent")
            XCTAssertTrue(elements.isEmpty())
        } catch {
            XCTFail("Should not throw error for empty results")
        }
    }
}

Best Practices Summary

  1. Always use do-catch blocks when working with SwiftSoup operations
  2. Handle specific error types differently based on your recovery strategy
  3. Use Result types for functional error handling in complex scenarios
  4. Implement graceful degradation with optional binding and fallback values
  5. Log errors comprehensively for debugging and monitoring
  6. Create custom error types for domain-specific error handling
  7. Test error scenarios thoroughly in your unit tests
  8. Implement retry mechanisms for transient failures
  9. Use async patterns for network-based scraping with proper error propagation

When implementing error handling patterns similar to how to handle timeouts in Puppeteer, remember that SwiftSoup's synchronous nature requires different strategies than asynchronous browser automation tools. However, the principles of comprehensive error handling and graceful degradation remain consistent across web scraping technologies.

For situations involving dynamic content that requires JavaScript execution, consider integrating SwiftSoup with headless browser solutions, applying similar error handling techniques to how to handle errors in Puppeteer.

By following these error handling patterns, your SwiftSoup-based applications will be more robust, easier to debug, and provide better user experiences when encountering parsing issues or malformed HTML content.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon