Table of contents

How to Handle Parse Errors When Using SwiftSoup

SwiftSoup is a powerful HTML parsing library for Swift that brings the functionality of Java's jsoup to iOS development. However, like any parsing library, it can encounter various errors when processing malformed HTML, network issues, or unexpected content structures. This guide covers comprehensive error handling strategies for SwiftSoup to build robust iOS applications.

Understanding SwiftSoup Error Types

SwiftSoup can throw several types of errors during HTML parsing and manipulation:

  • ParseError: Occurs when the HTML structure is malformed
  • IOException: Network-related errors when fetching remote content
  • IllegalArgumentException: Invalid selectors or parameters
  • UnsupportedEncodingException: Character encoding issues

Basic Error Handling with Try-Catch

The fundamental approach to handling SwiftSoup errors involves using Swift's do-try-catch pattern:

import SwiftSoup

func parseHTMLSafely(html: String) {
    do {
        let document = try SwiftSoup.parse(html)
        let title = try document.title()
        print("Document title: \(title)")
    } catch let parseError as Exception {
        print("Parse error occurred: \(parseError.getMessage())")
        handleParseError(parseError)
    } catch {
        print("Unexpected error: \(error)")
    }
}

func handleParseError(_ error: Exception) {
    // Log the error details
    print("Error type: \(type(of: error))")
    print("Error message: \(error.getMessage())")

    // Implement fallback logic here
    // For example, try alternative parsing approaches
}

Handling Network-Related Errors

When fetching HTML from remote URLs, network errors are common. Here's how to handle them effectively:

func fetchAndParseURL(urlString: String) {
    guard let url = URL(string: urlString) else {
        print("Invalid URL: \(urlString)")
        return
    }

    do {
        let document = try SwiftSoup.connect(urlString).get()
        let elements = try document.select("h1")

        for element in elements {
            print("H1 text: \(try element.text())")
        }
    } catch let ioException as IOException {
        print("Network error: \(ioException.getMessage())")
        handleNetworkError(ioException, url: url)
    } catch let parseException as Exception {
        print("Parse error: \(parseException.getMessage())")
        // Try alternative parsing strategies
        tryFallbackParsing(url: url)
    } catch {
        print("Unexpected error: \(error)")
    }
}

func handleNetworkError(_ error: IOException, url: URL) {
    // Implement retry logic
    DispatchQueue.main.asyncAfter(deadline: .now() + 2.0) {
        // Retry after delay
        print("Retrying connection to: \(url)")
        // Implement retry mechanism
    }
}

Robust Selector Error Handling

Invalid CSS selectors can cause SwiftSoup to throw exceptions. Here's how to validate and handle selector errors:

func safeElementSelection(document: Document, selector: String) -> Elements? {
    do {
        // Validate selector before using it
        if isValidSelector(selector) {
            return try document.select(selector)
        } else {
            print("Invalid selector: \(selector)")
            return nil
        }
    } catch let selectorException as Selector.SelectorParseException {
        print("Selector parse error: \(selectorException.getMessage())")
        return tryAlternativeSelectors(document: document, originalSelector: selector)
    } catch {
        print("Unexpected selector error: \(error)")
        return nil
    }
}

func isValidSelector(_ selector: String) -> Bool {
    // Basic selector validation
    return !selector.isEmpty && !selector.contains(">>") // Avoid unsupported selectors
}

func tryAlternativeSelectors(document: Document, originalSelector: String) -> Elements? {
    let alternatives = generateAlternativeSelectors(originalSelector)

    for alternative in alternatives {
        do {
            return try document.select(alternative)
        } catch {
            continue
        }
    }

    return Elements() // Return empty Elements if all alternatives fail
}

Handling Malformed HTML

When dealing with malformed HTML, SwiftSoup's parser is generally forgiving, but you may need additional error handling:

func parseHTML(html: String) -> Document? {
    do {
        // Use SwiftSoup's lenient parser
        let document = try SwiftSoup.parse(html)

        // Validate the parsed document
        if try validateDocument(document) {
            return document
        } else {
            return try repairAndParse(html: html)
        }
    } catch let parseError as Exception {
        print("Initial parse failed: \(parseError.getMessage())")
        return handleMalformedHTML(html: html)
    }
}

func validateDocument(_ document: Document) throws -> Bool {
    // Check if essential elements exist
    let body = try document.body()
    return body != nil
}

func repairAndParse(html: String) throws -> Document? {
    // Attempt to clean and repair HTML
    let cleanedHTML = cleanHTML(html)
    return try SwiftSoup.parse(cleanedHTML)
}

func cleanHTML(_ html: String) -> String {
    var cleaned = html

    // Remove problematic characters
    cleaned = cleaned.replacingOccurrences(of: "\0", with: "")

    // Fix common malformed patterns
    cleaned = cleaned.replacingOccurrences(of: "<br>", with: "<br/>")
    cleaned = cleaned.replacingOccurrences(of: "<img", with: "<img ")

    return cleaned
}

Comprehensive Error Handling Class

Here's a complete error handling wrapper for SwiftSoup operations:

class SwiftSoupErrorHandler {
    static let shared = SwiftSoupErrorHandler()

    private init() {}

    func parseWithRetry(html: String, maxRetries: Int = 3) -> Document? {
        var attempts = 0

        while attempts < maxRetries {
            do {
                return try SwiftSoup.parse(html)
            } catch let error as Exception {
                attempts += 1
                print("Parse attempt \(attempts) failed: \(error.getMessage())")

                if attempts < maxRetries {
                    // Try cleaning HTML before retry
                    let cleanedHTML = preprocessHTML(html)
                    if cleanedHTML != html {
                        return parseWithRetry(html: cleanedHTML, maxRetries: maxRetries - attempts)
                    }
                }
            } catch {
                print("Unexpected error during parse: \(error)")
                break
            }
        }

        return nil
    }

    func safeSelect(document: Document, selector: String) -> Elements {
        do {
            return try document.select(selector)
        } catch let selectorError as Exception {
            print("Selector error: \(selectorError.getMessage())")
            return handleSelectorError(document: document, selector: selector)
        } catch {
            print("Unexpected selector error: \(error)")
            return Elements()
        }
    }

    private func preprocessHTML(_ html: String) -> String {
        var processed = html

        // Remove null bytes
        processed = processed.replacingOccurrences(of: "\0", with: "")

        // Ensure proper encoding
        if let data = processed.data(using: .utf8) {
            processed = String(data: data, encoding: .utf8) ?? processed
        }

        return processed
    }

    private func handleSelectorError(document: Document, selector: String) -> Elements {
        // Try simplified selectors
        let simplifiedSelectors = [
            selector.components(separatedBy: " ").first ?? "",
            selector.replacingOccurrences(of: ":nth-child(\\d+)", with: "", options: .regularExpression),
            selector.components(separatedBy: ">").first?.trimmingCharacters(in: .whitespaces) ?? ""
        ]

        for simpleSelector in simplifiedSelectors {
            do {
                if !simpleSelector.isEmpty {
                    return try document.select(simpleSelector)
                }
            } catch {
                continue
            }
        }

        return Elements()
    }
}

Logging and Debugging Parse Errors

Effective logging helps diagnose parsing issues in production:

import os.log

extension SwiftSoupErrorHandler {
    func parseWithLogging(html: String) -> Document? {
        let logger = Logger(subsystem: "com.yourapp.parsing", category: "SwiftSoup")

        do {
            logger.info("Starting HTML parse, length: \(html.count)")
            let document = try SwiftSoup.parse(html)
            logger.info("Parse successful")
            return document
        } catch let parseError as Exception {
            logger.error("Parse error: \(parseError.getMessage())")
            logger.debug("HTML content: \(html.prefix(200))...")

            // Log additional context
            logParseContext(html: html, error: parseError, logger: logger)
            return nil
        } catch {
            logger.error("Unexpected parse error: \(error.localizedDescription)")
            return nil
        }
    }

    private func logParseContext(html: String, error: Exception, logger: Logger) {
        // Log HTML characteristics that might cause issues
        logger.debug("HTML starts with: \(html.prefix(100))")
        logger.debug("HTML ends with: \(html.suffix(100))")
        logger.debug("Contains null bytes: \(html.contains("\0"))")
        logger.debug("Character encoding issues detected: \(detectEncodingIssues(html))")
    }

    private func detectEncodingIssues(_ html: String) -> Bool {
        // Check for common encoding issues
        return html.contains("�") || html.contains("&amp;")
    }
}

Advanced Error Recovery Strategies

For production applications, implement sophisticated error recovery mechanisms:

class AdvancedSwiftSoupHandler {
    private let maxRetryAttempts = 3
    private let retryDelay: TimeInterval = 1.0

    func parseWithAdvancedRecovery(html: String) async -> Document? {
        // First attempt: Standard parsing
        if let document = tryStandardParse(html) {
            return document
        }

        // Second attempt: Cleaned HTML
        let cleanedHTML = performHTMLCleaning(html)
        if let document = tryStandardParse(cleanedHTML) {
            return document
        }

        // Third attempt: Fragment parsing
        return tryFragmentParsing(cleanedHTML)
    }

    private func tryStandardParse(_ html: String) -> Document? {
        do {
            return try SwiftSoup.parse(html)
        } catch {
            return nil
        }
    }

    private func performHTMLCleaning(_ html: String) -> String {
        var cleaned = html

        // Remove problematic Unicode characters
        cleaned = cleaned.replacingOccurrences(of: "\u{FEFF}", with: "") // BOM
        cleaned = cleaned.replacingOccurrences(of: "\0", with: "")      // Null bytes

        // Fix common HTML issues
        cleaned = cleaned.replacingOccurrences(of: "&nbsp", with: "&nbsp;")
        cleaned = cleaned.replacingOccurrences(of: "<meta([^>]*)>", with: "<meta$1/>", options: .regularExpression)

        // Ensure proper structure
        if !cleaned.contains("<html>") {
            cleaned = "<html><head></head><body>\(cleaned)</body></html>"
        }

        return cleaned
    }

    private func tryFragmentParsing(_ html: String) -> Document? {
        do {
            // Try parsing as body fragment
            let bodyFragment = try SwiftSoup.parseBodyFragment(html)
            return bodyFragment
        } catch {
            return nil
        }
    }
}

Error Handling for Concurrent Parsing

When parsing multiple HTML documents concurrently, proper error handling becomes crucial:

actor ConcurrentSwiftSoupParser {
    private var activeOperations: Set<UUID> = []

    func parseMultipleHTML(_ htmlContents: [String]) async -> [Document?] {
        await withTaskGroup(of: (Int, Document?).self) { group in
            var results: [Document?] = Array(repeating: nil, count: htmlContents.count)

            for (index, html) in htmlContents.enumerated() {
                group.addTask {
                    let document = await self.parseHTMLSafely(html, operationId: UUID())
                    return (index, document)
                }
            }

            for await (index, document) in group {
                results[index] = document
            }

            return results
        }
    }

    private func parseHTMLSafely(_ html: String, operationId: UUID) async -> Document? {
        activeOperations.insert(operationId)
        defer { activeOperations.remove(operationId) }

        do {
            return try SwiftSoup.parse(html)
        } catch let parseError as Exception {
            print("Concurrent parse error for operation \(operationId): \(parseError.getMessage())")

            // Implement fallback parsing strategy
            return await fallbackParse(html)
        } catch {
            print("Unexpected concurrent parse error: \(error)")
            return nil
        }
    }

    private func fallbackParse(_ html: String) async -> Document? {
        // Implement delay to avoid overwhelming the system
        try? await Task.sleep(nanoseconds: 100_000_000) // 100ms

        let cleanedHTML = html.replacingOccurrences(of: "\0", with: "")

        do {
            return try SwiftSoup.parse(cleanedHTML)
        } catch {
            return nil
        }
    }
}

Best Practices for Error Prevention

  1. Validate Input: Always validate HTML content before parsing
  2. Set Timeouts: Use appropriate timeouts for network operations
  3. Implement Fallbacks: Have alternative parsing strategies ready
  4. Monitor Performance: Track parsing success rates and error patterns
  5. Use Logging: Implement comprehensive logging for debugging

When building applications that process web content, similar error handling principles apply to other tools. For instance, when handling timeouts in Puppeteer, you'll encounter similar challenges with network delays and content loading issues, though the specific timeout mechanisms differ between iOS and JavaScript environments.

Testing Error Handling

Create comprehensive unit tests to verify your error handling works correctly:

import XCTest
@testable import YourApp

class SwiftSoupErrorHandlerTests: XCTestCase {

    func testMalformedHTMLHandling() {
        let malformedHTML = "<html><body><div><p>Unclosed tags"
        let handler = SwiftSoupErrorHandler.shared

        let document = handler.parseWithRetry(html: malformedHTML)
        XCTAssertNotNil(document, "Should handle malformed HTML gracefully")
    }

    func testInvalidSelectorHandling() {
        let html = "<html><body><div class='test'>Content</div></body></html>"

        do {
            let document = try SwiftSoup.parse(html)
            let handler = SwiftSoupErrorHandler.shared
            let elements = handler.safeSelect(document: document, selector: "invalid>>>selector")

            XCTAssertTrue(elements.isEmpty(), "Should return empty elements for invalid selector")
        } catch {
            XCTFail("Should not throw exception: \(error)")
        }
    }

    func testNetworkErrorHandling() {
        // Test with invalid URL
        let invalidURL = "http://nonexistent-domain-12345.com"

        XCTAssertThrowsError(try SwiftSoup.connect(invalidURL).get()) { error in
            XCTAssertTrue(error is IOException, "Should throw IOException for network errors")
        }
    }

    func testEncodingErrorHandling() {
        let htmlWithBadEncoding = "<!DOCTYPE html><html><body>Test \0 content</body></html>"
        let handler = SwiftSoupErrorHandler.shared

        let document = handler.parseWithRetry(html: htmlWithBadEncoding)
        XCTAssertNotNil(document, "Should handle encoding issues gracefully")
    }
}

Performance Monitoring and Metrics

Implement monitoring to track parsing performance and error rates:

class SwiftSoupMetrics {
    static let shared = SwiftSoupMetrics()

    private var parseAttempts = 0
    private var parseFailures = 0
    private var totalParseTime: TimeInterval = 0

    private init() {}

    func recordParseAttempt<T>(operation: () throws -> T) rethrows -> T {
        parseAttempts += 1
        let startTime = CFAbsoluteTimeGetCurrent()

        defer {
            let endTime = CFAbsoluteTimeGetCurrent()
            totalParseTime += (endTime - startTime)
        }

        do {
            return try operation()
        } catch {
            parseFailures += 1
            throw error
        }
    }

    func getMetrics() -> (attempts: Int, failures: Int, averageTime: TimeInterval, successRate: Double) {
        let averageTime = parseAttempts > 0 ? totalParseTime / Double(parseAttempts) : 0
        let successRate = parseAttempts > 0 ? Double(parseAttempts - parseFailures) / Double(parseAttempts) : 0

        return (parseAttempts, parseFailures, averageTime, successRate)
    }

    func resetMetrics() {
        parseAttempts = 0
        parseFailures = 0
        totalParseTime = 0
    }
}

By implementing comprehensive error handling strategies, you can build robust iOS applications that gracefully handle the unpredictable nature of web content parsing. Remember to log errors appropriately, implement fallback mechanisms, and test your error handling thoroughly to ensure a smooth user experience even when parsing fails.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon