Table of contents

How do I parse HTML with custom or unknown tags using SwiftSoup?

When working with web scraping or HTML parsing in iOS applications, you'll often encounter documents containing custom tags, XML namespaces, or non-standard HTML elements. SwiftSoup, a Swift port of the popular Java library JSoup, provides robust capabilities for handling these scenarios. This comprehensive guide will show you how to effectively parse HTML with custom or unknown tags using SwiftSoup.

Understanding Custom and Unknown Tags

Custom tags can appear in various forms: - Web Components: Custom HTML elements like <my-component>, <user-card>, or <data-widget> - XML Namespaces: Elements with prefixes like <fb:like>, <og:image>, or <custom:element> - Non-standard HTML: Proprietary tags used by specific platforms or applications - Malformed HTML: Tags with unusual structures or naming conventions

SwiftSoup handles these situations gracefully by treating unknown tags as regular elements, making them fully accessible through its parsing API.

Basic Setup and Installation

First, ensure you have SwiftSoup installed in your iOS project. Add it to your Package.swift or through Xcode's Package Manager:

dependencies: [
    .package(url: "https://github.com/scinfu/SwiftSoup.git", from: "2.4.3")
]

Import SwiftSoup in your Swift file:

import SwiftSoup

Parsing Custom Tags

Simple Custom Tag Parsing

Here's how to parse HTML containing custom tags:

import SwiftSoup

func parseCustomTags() {
    let htmlContent = """
    <html>
    <body>
        <user-profile id="123">
            <user-name>John Doe</user-name>
            <user-email>john@example.com</user-email>
            <custom-data type="preferences">
                <theme>dark</theme>
                <language>en-US</language>
            </custom-data>
        </user-profile>
        <widget-container>
            <data-widget source="api" refresh="5000">
                <widget-title>Live Stats</widget-title>
                <widget-content>Loading...</widget-content>
            </data-widget>
        </widget-container>
    </body>
    </html>
    """

    do {
        let document = try SwiftSoup.parse(htmlContent)

        // Extract data from custom tags
        let userProfile = try document.select("user-profile").first()
        let userName = try userProfile?.select("user-name")?.text()
        let userEmail = try userProfile?.select("user-email")?.text()

        print("User Name: \(userName ?? "N/A")")
        print("User Email: \(userEmail ?? "N/A")")

        // Access custom attributes
        let userId = try userProfile?.attr("id")
        print("User ID: \(userId ?? "N/A")")

        // Parse nested custom elements
        let customData = try userProfile?.select("custom-data").first()
        let theme = try customData?.select("theme")?.text()
        let language = try customData?.select("language")?.text()

        print("Theme: \(theme ?? "N/A")")
        print("Language: \(language ?? "N/A")")

    } catch {
        print("Error parsing HTML: \(error)")
    }
}

Handling XML Namespaces

SwiftSoup can also handle XML namespaces in HTML documents:

func parseNamespacedTags() {
    let htmlWithNamespaces = """
    <html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
    <head>
        <og:title>Custom Page Title</og:title>
        <og:description>Page description for social sharing</og:description>
        <og:image>https://example.com/image.jpg</og:image>
        <fb:app_id>123456789</fb:app_id>
    </head>
    <body>
        <fb:like href="https://example.com" width="300" layout="standard"></fb:like>
        <custom:widget type="analytics">
            <custom:metric name="views">1234</custom:metric>
            <custom:metric name="clicks">56</custom:metric>
        </custom:widget>
    </body>
    </html>
    """

    do {
        let document = try SwiftSoup.parse(htmlWithNamespaces)

        // Parse Open Graph tags
        let ogTitle = try document.select("og\\:title").text()
        let ogDescription = try document.select("og\\:description").text()
        let ogImage = try document.select("og\\:image").text()

        print("OG Title: \(ogTitle)")
        print("OG Description: \(ogDescription)")
        print("OG Image: \(ogImage)")

        // Parse Facebook tags
        let fbAppId = try document.select("fb\\:app_id").text()
        let fbLike = try document.select("fb\\:like").first()
        let likeUrl = try fbLike?.attr("href")

        print("FB App ID: \(fbAppId)")
        print("Like URL: \(likeUrl ?? "N/A")")

        // Parse custom namespaced elements
        let metrics = try document.select("custom\\:metric")
        for metric in metrics {
            let name = try metric.attr("name")
            let value = try metric.text()
            print("Metric \(name): \(value)")
        }

    } catch {
        print("Error parsing namespaced HTML: \(error)")
    }
}

Advanced Custom Tag Handling

Dynamic Tag Discovery

Sometimes you need to discover all custom tags in a document without knowing their names beforehand:

func discoverCustomTags() {
    let htmlContent = """
    <div>
        <standard-tag>Regular content</standard-tag>
        <unknown-element data-type="mystery">Mystery content</unknown-element>
        <xyz-component>Component content</xyz-component>
        <legacy-widget status="active">Legacy content</legacy-widget>
    </div>
    """

    do {
        let document = try SwiftSoup.parse(htmlContent)
        let allElements = try document.select("*")

        var customTags: Set<String> = []
        let standardTags = ["html", "head", "body", "div", "span", "p", "a", "img", "h1", "h2", "h3", "h4", "h5", "h6"]

        for element in allElements {
            let tagName = element.tagName().lowercased()

            // Identify custom tags (containing hyphens or not in standard HTML tags)
            if tagName.contains("-") || !standardTags.contains(tagName) {
                customTags.insert(tagName)
            }
        }

        print("Discovered custom tags: \(Array(customTags).sorted())")

        // Process each custom tag type
        for tagName in customTags {
            let elements = try document.select(tagName)
            print("\nFound \(elements.count) \(tagName) element(s):")

            for element in elements {
                let content = try element.text()
                let attributes = element.getAttributes()
                print("  Content: \(content)")
                print("  Attributes: \(attributes)")
            }
        }

    } catch {
        print("Error discovering custom tags: \(error)")
    }
}

Handling Malformed Custom Tags

SwiftSoup is quite forgiving with malformed HTML, but you might need special handling for edge cases:

func handleMalformedTags() {
    let malformedHtml = """
    <div>
        <unclosed-tag>Content without closing tag
        <self-closing-custom />
        <123-invalid-start>Numeric start</123-invalid-start>
        <valid-tag attribute-without-value>Valid content</valid-tag>
        <UPPERCASE-TAG>Mixed case content</UPPERCASE-TAG>
    </div>
    """

    do {
        let document = try SwiftSoup.parse(malformedHtml)

        // SwiftSoup automatically handles unclosed tags
        let unclosedTag = try document.select("unclosed-tag").first()
        if let tag = unclosedTag {
            print("Unclosed tag content: \(try tag.text())")
        }

        // Handle self-closing custom tags
        let selfClosing = try document.select("self-closing-custom")
        print("Self-closing tags found: \(selfClosing.count)")

        // Case-insensitive selection
        let uppercaseTag = try document.select("uppercase-tag").first()
        if let tag = uppercaseTag {
            print("Uppercase tag content: \(try tag.text())")
        }

        // Extract attributes even from malformed tags
        let validTag = try document.select("valid-tag").first()
        if let tag = validTag {
            let hasAttribute = tag.hasAttr("attribute-without-value")
            print("Has attribute without value: \(hasAttribute)")
        }

    } catch {
        print("Error handling malformed tags: \(error)")
    }
}

Integration with Web Scraping Workflows

When scraping modern web applications, custom tags often contain valuable data. Here's how to integrate custom tag parsing into a comprehensive scraping workflow:

class CustomTagScraper {
    private let document: Document

    init(html: String) throws {
        self.document = try SwiftSoup.parse(html)
    }

    func extractWebComponents() throws -> [String: Any] {
        var results: [String: Any] = [:]

        // Extract React/Vue component data
        let reactComponents = try document.select("[data-reactroot] *")
        var componentData: [[String: String]] = []

        for component in reactComponents {
            if component.tagName().contains("-") {
                let data: [String: String] = [
                    "tagName": component.tagName(),
                    "content": try component.text(),
                    "attributes": component.getAttributes().asDictionary().description
                ]
                componentData.append(data)
            }
        }
        results["webComponents"] = componentData

        // Extract microdata
        let microdataItems = try document.select("[itemscope]")
        var microdata: [[String: String]] = []

        for item in microdataItems {
            let itemType = try item.attr("itemtype")
            let properties = try item.select("[itemprop]")

            var itemData: [String: String] = ["itemtype": itemType]
            for property in properties {
                let propName = try property.attr("itemprop")
                let propValue = try property.text()
                itemData[propName] = propValue
            }
            microdata.append(itemData)
        }
        results["microdata"] = microdata

        return results
    }

    func extractCustomAttributes() throws -> [String: [String]] {
        var customAttributes: [String: [String]] = [:]

        let elementsWithDataAttrs = try document.select("[data-*]")

        for element in elementsWithDataAttrs {
            let attributes = element.getAttributes()

            for attribute in attributes {
                if attribute.getKey().starts(with: "data-") {
                    let key = attribute.getKey()
                    if customAttributes[key] == nil {
                        customAttributes[key] = []
                    }
                    customAttributes[key]?.append(attribute.getValue())
                }
            }
        }

        return customAttributes
    }
}

// Usage example
func scrapeWithCustomTags() {
    let html = """
    <div data-reactroot="">
        <user-card data-user-id="123" data-premium="true">
            <h2 itemprop="name">Jane Smith</h2>
            <span itemprop="jobTitle">Software Engineer</span>
        </user-card>
        <stats-widget data-source="analytics" data-refresh-rate="30">
            <metric-display type="views">15,234</metric-display>
            <metric-display type="conversions">1,234</metric-display>
        </stats-widget>
    </div>
    """

    do {
        let scraper = try CustomTagScraper(html: html)

        let webComponents = try scraper.extractWebComponents()
        print("Web Components: \(webComponents)")

        let customAttributes = try scraper.extractCustomAttributes()
        print("Custom Attributes: \(customAttributes)")

    } catch {
        print("Scraping error: \(error)")
    }
}

Best Practices and Tips

Error Handling and Validation

Always implement robust error handling when working with custom tags:

extension Document {
    func safeSelect(_ selector: String) -> Elements? {
        do {
            return try self.select(selector)
        } catch {
            print("Invalid selector '\(selector)': \(error)")
            return nil
        }
    }
}

func safeCustomTagParsing() {
    let html = "<custom:tag>Content</custom:tag>"

    do {
        let document = try SwiftSoup.parse(html)

        // Safe selection with error handling
        if let elements = document.safeSelect("custom\\:tag") {
            for element in elements {
                let content = try? element.text()
                print("Content: \(content ?? "Unable to extract")")
            }
        }

    } catch {
        print("Parsing error: \(error)")
    }
}

Performance Considerations

When dealing with large documents containing many custom tags:

  1. Use specific selectors: Instead of select("*"), use targeted selectors
  2. Cache commonly used elements: Store frequently accessed elements in variables
  3. Process in batches: For large datasets, process elements in smaller batches
  4. Consider streaming: For very large documents, consider streaming parsing approaches

Common Use Cases

SwiftSoup's custom tag parsing capabilities are particularly useful when working with:

  • Single Page Applications (SPAs): Modern frameworks often use custom elements
  • XML-based APIs: Many APIs return XML with custom namespaces
  • Legacy HTML: Older websites may use proprietary tags
  • Web Components: Modern web development increasingly uses custom elements

For complex JavaScript-heavy applications that require dynamic content loading, you might also want to explore how to handle browser sessions in Puppeteer for more advanced scraping scenarios.

Conclusion

SwiftSoup provides excellent support for parsing HTML with custom or unknown tags, making it an ideal choice for iOS developers working on web scraping projects. Its flexible parsing engine handles various edge cases gracefully while providing a clean, Swift-friendly API for extracting data from complex HTML structures.

Whether you're working with modern web components, XML namespaces, or legacy HTML with proprietary tags, SwiftSoup's robust parsing capabilities ensure your iOS applications can effectively extract the data they need. Remember to always implement proper error handling and consider performance implications when working with large documents containing numerous custom elements.

For additional web scraping challenges involving dynamic content, consider exploring how to handle AJAX requests using Puppeteer for scenarios where client-side rendering is involved.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon