How do I parse HTML fragments instead of complete documents with SwiftSoup?

SwiftSoup provides specialized methods for parsing HTML fragments rather than complete documents. This is particularly useful when working with partial HTML content, user-generated content, or when extracting specific portions of web pages in iOS applications.

Understanding HTML Fragments vs Complete Documents

HTML fragments are partial HTML content that don't contain the complete document structure (html, head, body tags). Examples include:

Content from APIs or databases
User-generated HTML content
Partial HTML snippets
Email templates or content blocks

SwiftSoup handles fragments differently from complete documents to ensure proper parsing and DOM structure.

Basic Fragment Parsing

Using parseBodyFragment()

The primary method for parsing HTML fragments in SwiftSoup is parseBodyFragment():

import SwiftSoup

// Parse a simple HTML fragment
let htmlFragment = "<div class='content'><p>Hello World</p><span>Test</span></div>"

do {
    let doc = try SwiftSoup.parseBodyFragment(htmlFragment)
    let body = doc.body()

    // Extract content
    let content = try body?.select("div.content")
    print(try content?.text() ?? "No content found")
    // Output: Hello World Test

} catch {
    print("Error parsing fragment: \(error)")
}

Parsing with Base URI

When parsing fragments that contain relative URLs, specify a base URI:

let htmlFragment = """
<div>
    <img src="/images/logo.png" alt="Logo">
    <a href="/about">About Us</a>
</div>
"""

let baseUri = "https://example.com"

do {
    let doc = try SwiftSoup.parseBodyFragment(htmlFragment, baseUri)

    // Get absolute URLs
    let images = try doc.select("img")
    for img in images {
        let absoluteSrc = try img.absUrl("src")
        print("Image URL: \(absoluteSrc)")
        // Output: Image URL: https://example.com/images/logo.png
    }

} catch {
    print("Error: \(error)")
}

Advanced Fragment Parsing Techniques

Parsing Multiple Fragments

When working with multiple HTML fragments, you can combine them or process them individually:

let fragments = [
    "<div class='item'>Item 1</div>",
    "<div class='item'>Item 2</div>",
    "<div class='item'>Item 3</div>"
]

var allItems: [Element] = []

for fragment in fragments {
    do {
        let doc = try SwiftSoup.parseBodyFragment(fragment)
        let items = try doc.select("div.item")
        allItems.append(contentsOf: items)
    } catch {
        print("Error parsing fragment: \(error)")
    }
}

print("Total items parsed: \(allItems.count)")

Fragment Parsing with Custom Settings

You can create a custom parser for fragments with specific settings:

import SwiftSoup

func parseFragmentWithCustomSettings(_ html: String) throws -> Document {
    // Parse as fragment
    let doc = try SwiftSoup.parseBodyFragment(html)

    // Normalize the document
    doc.normalise()

    // Set output settings
    try doc.outputSettings()
        .prettyPrint(pretty: true)
        .indentAmount(2)

    return doc
}

// Usage
let htmlFragment = "<div><p>Unformatted content</p></div>"

do {
    let doc = try parseFragmentWithCustomSettings(htmlFragment)
    let prettyHtml = try doc.html()
    print(prettyHtml)
} catch {
    print("Error: \(error)")
}

Working with Fragment Content

Extracting Data from Fragments

Here's how to extract specific data from HTML fragments:

let productFragment = """
<div class="product" data-id="123">
    <h3 class="title">iPhone 15</h3>
    <span class="price">$999</span>
    <div class="description">
        <p>Latest iPhone with advanced features</p>
        <ul class="features">
            <li>A17 Pro chip</li>
            <li>48MP camera</li>
            <li>USB-C</li>
        </ul>
    </div>
</div>
"""

do {
    let doc = try SwiftSoup.parseBodyFragment(productFragment)

    // Extract product details
    let productId = try doc.select("div.product").first()?.attr("data-id") ?? ""
    let title = try doc.select("h3.title").text()
    let price = try doc.select("span.price").text()
    let features = try doc.select("ul.features li").map { try $0.text() }

    print("Product ID: \(productId)")
    print("Title: \(title)")
    print("Price: \(price)")
    print("Features: \(features)")

} catch {
    print("Error extracting data: \(error)")
}

Modifying Fragment Content

SwiftSoup allows you to modify parsed fragments before using them:

let htmlFragment = """
<div class="content">
    <p>Original content</p>
    <img src="old-image.jpg" alt="Old Image">
</div>
"""

do {
    let doc = try SwiftSoup.parseBodyFragment(htmlFragment)

    // Modify content
    try doc.select("p").first()?.text("Updated content")
    try doc.select("img").first()?.attr("src", "new-image.jpg")
    try doc.select("img").first()?.attr("alt", "New Image")

    // Add new elements
    let newDiv = try doc.createElement("div")
    try newDiv.attr("class", "footer")
    try newDiv.text("Added footer content")
    try doc.body()?.appendChild(newDiv)

    // Get modified HTML
    let modifiedHtml = try doc.body()?.html() ?? ""
    print(modifiedHtml)

} catch {
    print("Error modifying fragment: \(error)")
}

Best Practices for Fragment Parsing

Handling Malformed Fragments

SwiftSoup automatically corrects malformed HTML, but you should validate your fragments:

func parseAndValidateFragment(_ html: String) -> Document? {
    do {
        let doc = try SwiftSoup.parseBodyFragment(html)

        // Validate structure
        guard let body = doc.body() else {
            print("Warning: Fragment produced empty body")
            return nil
        }

        // Check for parsing errors
        let errors = doc.getErrors()
        if !errors.isEmpty {
            print("Parsing warnings: \(errors)")
        }

        return doc

    } catch {
        print("Failed to parse fragment: \(error)")
        return nil
    }
}

// Test with malformed HTML
let malformedFragment = "<div><p>Unclosed paragraph<span>Nested content</div>"
if let doc = parseAndValidateFragment(malformedFragment) {
    print("Successfully parsed and corrected malformed fragment")
}

Performance Considerations

When parsing many fragments, consider reusing parser instances:

class FragmentParser {
    private var parser: Parser

    init() {
        self.parser = Parser.htmlParser()
    }

    func parseFragment(_ html: String, baseUri: String = "") throws -> Document {
        return try SwiftSoup.parseBodyFragment(html, baseUri)
    }

    func parseBatch(_ fragments: [String]) -> [Document] {
        return fragments.compactMap { fragment in
            try? parseFragment(fragment)
        }
    }
}

// Usage
let parser = FragmentParser()
let fragments = ["<div>Fragment 1</div>", "<div>Fragment 2</div>"]
let documents = parser.parseBatch(fragments)

Error Handling and Debugging

Comprehensive Error Handling

enum FragmentParsingError: Error {
    case emptyFragment
    case parsingFailed(String)
    case invalidStructure
}

func robustFragmentParser(_ html: String) throws -> Document {
    guard !html.trimmingCharacters(in: .whitespacesAndNewlines).isEmpty else {
        throw FragmentParsingError.emptyFragment
    }

    do {
        let doc = try SwiftSoup.parseBodyFragment(html)

        // Verify we have valid content
        guard let body = doc.body(), try body.children().size() > 0 else {
            throw FragmentParsingError.invalidStructure
        }

        return doc

    } catch let error as SwiftSoupError {
        throw FragmentParsingError.parsingFailed(error.localizedDescription)
    } catch {
        throw FragmentParsingError.parsingFailed("Unknown parsing error")
    }
}

Integration with iOS Applications

Using Fragments in Table Views

class HTMLFragmentTableViewCell: UITableViewCell {
    @IBOutlet weak var webView: WKWebView!

    func configure(with fragment: String) {
        do {
            let doc = try SwiftSoup.parseBodyFragment(fragment)

            // Add CSS styling
            let head = doc.head()
            let style = try doc.createElement("style")
            try style.html("""
                body { font-family: -apple-system; margin: 10px; }
                .content { line-height: 1.4; }
            """)
            try head?.appendChild(style)

            let fullHtml = try doc.outerHtml()
            webView.loadHTMLString(fullHtml, baseURL: nil)

        } catch {
            print("Error configuring cell: \(error)")
        }
    }
}

Processing Fragment Collections

When working with collections of fragments, such as from RSS feeds or API responses:

struct ContentProcessor {
    func processFragmentCollection(_ fragments: [String]) -> [ProcessedContent] {
        return fragments.compactMap { fragment in
            do {
                let doc = try SwiftSoup.parseBodyFragment(fragment)

                // Extract standardized data
                let title = try doc.select("h1, h2, h3").first()?.text() ?? ""
                let text = try doc.select("p").text()
                let images = try doc.select("img").map { try $0.attr("src") }

                return ProcessedContent(title: title, text: text, images: images)

            } catch {
                print("Failed to process fragment: \(error)")
                return nil
            }
        }
    }
}

struct ProcessedContent {
    let title: String
    let text: String
    let images: [String]
}

Security Considerations

Sanitizing User-Generated Fragments

When dealing with user-generated HTML fragments, always sanitize the content:

func sanitizeFragment(_ html: String) -> String? {
    do {
        let doc = try SwiftSoup.parseBodyFragment(html)

        // Remove potentially dangerous tags
        try doc.select("script, iframe, object, embed").remove()

        // Remove JavaScript event handlers
        let elements = try doc.select("*")
        for element in elements {
            let attributes = element.getAttributes()
            for attr in attributes {
                if attr.getKey().lowercased().hasPrefix("on") {
                    element.removeAttr(attr.getKey())
                }
            }
        }

        // Allow only safe attributes
        let allowedTags = ["p", "div", "span", "strong", "em", "ul", "ol", "li", "h1", "h2", "h3", "h4", "h5", "h6"]
        let allowedAttrs = ["class", "id"]

        // This is a simplified example - consider using a proper HTML sanitizer
        return try doc.body()?.html()

    } catch {
        print("Error sanitizing fragment: \(error)")
        return nil
    }
}

Comparison with Other Parsing Methods

Unlike parsing complete documents, fragment parsing with parseBodyFragment() offers several advantages:

Automatic wrapping: Fragments are automatically wrapped in proper HTML structure
Context preservation: Maintains proper DOM relationships
Error correction: Automatically fixes unclosed tags and malformed HTML
Base URI support: Resolves relative URLs when provided

Fragment parsing is essential when working with partial HTML content in iOS development. It ensures your content is properly structured and ready for display or further processing. For web-based applications dealing with dynamic content, you might also want to understand how to handle AJAX requests using Puppeteer for similar challenges in different contexts.

Common Use Cases

Processing Rich Text Content

func processRichTextFragment(_ html: String) -> NSAttributedString? {
    do {
        let doc = try SwiftSoup.parseBodyFragment(html)

        // Convert to attributed string for display in UITextView
        let htmlData = try doc.html().data(using: .utf8)

        return try NSAttributedString(
            data: htmlData ?? Data(),
            options: [.documentType: NSAttributedString.DocumentType.html,
                     .characterEncoding: String.Encoding.utf8.rawValue],
            documentAttributes: nil
        )

    } catch {
        print("Error processing rich text: \(error)")
        return nil
    }
}

Fragment-Based Template System

class TemplateProcessor {
    func processTemplate(_ template: String, with data: [String: String]) -> String? {
        do {
            let doc = try SwiftSoup.parseBodyFragment(template)

            // Replace template variables
            for (key, value) in data {
                let selector = "[data-template='\(key)']"
                let elements = try doc.select(selector)
                for element in elements {
                    try element.text(value)
                }
            }

            return try doc.body()?.html()

        } catch {
            print("Error processing template: \(error)")
            return nil
        }
    }
}

// Usage
let template = "<div><span data-template='username'>{{username}}</span></div>"
let processor = TemplateProcessor()
let result = processor.processTemplate(template, with: ["username": "John Doe"])

Understanding the differences between fragment and document parsing is crucial for building robust iOS applications that handle HTML content effectively, especially when dealing with user-generated content or API responses that return partial HTML structures. This approach ensures better performance, security, and maintainability in your SwiftSoup-based applications.

Table of contents