How do I parse HTML with custom or unknown tags using SwiftSoup?
When working with web scraping or HTML parsing in iOS applications, you'll often encounter documents containing custom tags, XML namespaces, or non-standard HTML elements. SwiftSoup, a Swift port of the popular Java library JSoup, provides robust capabilities for handling these scenarios. This comprehensive guide will show you how to effectively parse HTML with custom or unknown tags using SwiftSoup.
Understanding Custom and Unknown Tags
Custom tags can appear in various forms:
- Web Components: Custom HTML elements like <my-component>
, <user-card>
, or <data-widget>
- XML Namespaces: Elements with prefixes like <fb:like>
, <og:image>
, or <custom:element>
- Non-standard HTML: Proprietary tags used by specific platforms or applications
- Malformed HTML: Tags with unusual structures or naming conventions
SwiftSoup handles these situations gracefully by treating unknown tags as regular elements, making them fully accessible through its parsing API.
Basic Setup and Installation
First, ensure you have SwiftSoup installed in your iOS project. Add it to your Package.swift
or through Xcode's Package Manager:
dependencies: [
.package(url: "https://github.com/scinfu/SwiftSoup.git", from: "2.4.3")
]
Import SwiftSoup in your Swift file:
import SwiftSoup
Parsing Custom Tags
Simple Custom Tag Parsing
Here's how to parse HTML containing custom tags:
import SwiftSoup
func parseCustomTags() {
let htmlContent = """
<html>
<body>
<user-profile id="123">
<user-name>John Doe</user-name>
<user-email>john@example.com</user-email>
<custom-data type="preferences">
<theme>dark</theme>
<language>en-US</language>
</custom-data>
</user-profile>
<widget-container>
<data-widget source="api" refresh="5000">
<widget-title>Live Stats</widget-title>
<widget-content>Loading...</widget-content>
</data-widget>
</widget-container>
</body>
</html>
"""
do {
let document = try SwiftSoup.parse(htmlContent)
// Extract data from custom tags
let userProfile = try document.select("user-profile").first()
let userName = try userProfile?.select("user-name")?.text()
let userEmail = try userProfile?.select("user-email")?.text()
print("User Name: \(userName ?? "N/A")")
print("User Email: \(userEmail ?? "N/A")")
// Access custom attributes
let userId = try userProfile?.attr("id")
print("User ID: \(userId ?? "N/A")")
// Parse nested custom elements
let customData = try userProfile?.select("custom-data").first()
let theme = try customData?.select("theme")?.text()
let language = try customData?.select("language")?.text()
print("Theme: \(theme ?? "N/A")")
print("Language: \(language ?? "N/A")")
} catch {
print("Error parsing HTML: \(error)")
}
}
Handling XML Namespaces
SwiftSoup can also handle XML namespaces in HTML documents:
func parseNamespacedTags() {
let htmlWithNamespaces = """
<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
<head>
<og:title>Custom Page Title</og:title>
<og:description>Page description for social sharing</og:description>
<og:image>https://example.com/image.jpg</og:image>
<fb:app_id>123456789</fb:app_id>
</head>
<body>
<fb:like href="https://example.com" width="300" layout="standard"></fb:like>
<custom:widget type="analytics">
<custom:metric name="views">1234</custom:metric>
<custom:metric name="clicks">56</custom:metric>
</custom:widget>
</body>
</html>
"""
do {
let document = try SwiftSoup.parse(htmlWithNamespaces)
// Parse Open Graph tags
let ogTitle = try document.select("og\\:title").text()
let ogDescription = try document.select("og\\:description").text()
let ogImage = try document.select("og\\:image").text()
print("OG Title: \(ogTitle)")
print("OG Description: \(ogDescription)")
print("OG Image: \(ogImage)")
// Parse Facebook tags
let fbAppId = try document.select("fb\\:app_id").text()
let fbLike = try document.select("fb\\:like").first()
let likeUrl = try fbLike?.attr("href")
print("FB App ID: \(fbAppId)")
print("Like URL: \(likeUrl ?? "N/A")")
// Parse custom namespaced elements
let metrics = try document.select("custom\\:metric")
for metric in metrics {
let name = try metric.attr("name")
let value = try metric.text()
print("Metric \(name): \(value)")
}
} catch {
print("Error parsing namespaced HTML: \(error)")
}
}
Advanced Custom Tag Handling
Dynamic Tag Discovery
Sometimes you need to discover all custom tags in a document without knowing their names beforehand:
func discoverCustomTags() {
let htmlContent = """
<div>
<standard-tag>Regular content</standard-tag>
<unknown-element data-type="mystery">Mystery content</unknown-element>
<xyz-component>Component content</xyz-component>
<legacy-widget status="active">Legacy content</legacy-widget>
</div>
"""
do {
let document = try SwiftSoup.parse(htmlContent)
let allElements = try document.select("*")
var customTags: Set<String> = []
let standardTags = ["html", "head", "body", "div", "span", "p", "a", "img", "h1", "h2", "h3", "h4", "h5", "h6"]
for element in allElements {
let tagName = element.tagName().lowercased()
// Identify custom tags (containing hyphens or not in standard HTML tags)
if tagName.contains("-") || !standardTags.contains(tagName) {
customTags.insert(tagName)
}
}
print("Discovered custom tags: \(Array(customTags).sorted())")
// Process each custom tag type
for tagName in customTags {
let elements = try document.select(tagName)
print("\nFound \(elements.count) \(tagName) element(s):")
for element in elements {
let content = try element.text()
let attributes = element.getAttributes()
print(" Content: \(content)")
print(" Attributes: \(attributes)")
}
}
} catch {
print("Error discovering custom tags: \(error)")
}
}
Handling Malformed Custom Tags
SwiftSoup is quite forgiving with malformed HTML, but you might need special handling for edge cases:
func handleMalformedTags() {
let malformedHtml = """
<div>
<unclosed-tag>Content without closing tag
<self-closing-custom />
<123-invalid-start>Numeric start</123-invalid-start>
<valid-tag attribute-without-value>Valid content</valid-tag>
<UPPERCASE-TAG>Mixed case content</UPPERCASE-TAG>
</div>
"""
do {
let document = try SwiftSoup.parse(malformedHtml)
// SwiftSoup automatically handles unclosed tags
let unclosedTag = try document.select("unclosed-tag").first()
if let tag = unclosedTag {
print("Unclosed tag content: \(try tag.text())")
}
// Handle self-closing custom tags
let selfClosing = try document.select("self-closing-custom")
print("Self-closing tags found: \(selfClosing.count)")
// Case-insensitive selection
let uppercaseTag = try document.select("uppercase-tag").first()
if let tag = uppercaseTag {
print("Uppercase tag content: \(try tag.text())")
}
// Extract attributes even from malformed tags
let validTag = try document.select("valid-tag").first()
if let tag = validTag {
let hasAttribute = tag.hasAttr("attribute-without-value")
print("Has attribute without value: \(hasAttribute)")
}
} catch {
print("Error handling malformed tags: \(error)")
}
}
Integration with Web Scraping Workflows
When scraping modern web applications, custom tags often contain valuable data. Here's how to integrate custom tag parsing into a comprehensive scraping workflow:
class CustomTagScraper {
private let document: Document
init(html: String) throws {
self.document = try SwiftSoup.parse(html)
}
func extractWebComponents() throws -> [String: Any] {
var results: [String: Any] = [:]
// Extract React/Vue component data
let reactComponents = try document.select("[data-reactroot] *")
var componentData: [[String: String]] = []
for component in reactComponents {
if component.tagName().contains("-") {
let data: [String: String] = [
"tagName": component.tagName(),
"content": try component.text(),
"attributes": component.getAttributes().asDictionary().description
]
componentData.append(data)
}
}
results["webComponents"] = componentData
// Extract microdata
let microdataItems = try document.select("[itemscope]")
var microdata: [[String: String]] = []
for item in microdataItems {
let itemType = try item.attr("itemtype")
let properties = try item.select("[itemprop]")
var itemData: [String: String] = ["itemtype": itemType]
for property in properties {
let propName = try property.attr("itemprop")
let propValue = try property.text()
itemData[propName] = propValue
}
microdata.append(itemData)
}
results["microdata"] = microdata
return results
}
func extractCustomAttributes() throws -> [String: [String]] {
var customAttributes: [String: [String]] = [:]
let elementsWithDataAttrs = try document.select("[data-*]")
for element in elementsWithDataAttrs {
let attributes = element.getAttributes()
for attribute in attributes {
if attribute.getKey().starts(with: "data-") {
let key = attribute.getKey()
if customAttributes[key] == nil {
customAttributes[key] = []
}
customAttributes[key]?.append(attribute.getValue())
}
}
}
return customAttributes
}
}
// Usage example
func scrapeWithCustomTags() {
let html = """
<div data-reactroot="">
<user-card data-user-id="123" data-premium="true">
<h2 itemprop="name">Jane Smith</h2>
<span itemprop="jobTitle">Software Engineer</span>
</user-card>
<stats-widget data-source="analytics" data-refresh-rate="30">
<metric-display type="views">15,234</metric-display>
<metric-display type="conversions">1,234</metric-display>
</stats-widget>
</div>
"""
do {
let scraper = try CustomTagScraper(html: html)
let webComponents = try scraper.extractWebComponents()
print("Web Components: \(webComponents)")
let customAttributes = try scraper.extractCustomAttributes()
print("Custom Attributes: \(customAttributes)")
} catch {
print("Scraping error: \(error)")
}
}
Best Practices and Tips
Error Handling and Validation
Always implement robust error handling when working with custom tags:
extension Document {
func safeSelect(_ selector: String) -> Elements? {
do {
return try self.select(selector)
} catch {
print("Invalid selector '\(selector)': \(error)")
return nil
}
}
}
func safeCustomTagParsing() {
let html = "<custom:tag>Content</custom:tag>"
do {
let document = try SwiftSoup.parse(html)
// Safe selection with error handling
if let elements = document.safeSelect("custom\\:tag") {
for element in elements {
let content = try? element.text()
print("Content: \(content ?? "Unable to extract")")
}
}
} catch {
print("Parsing error: \(error)")
}
}
Performance Considerations
When dealing with large documents containing many custom tags:
- Use specific selectors: Instead of
select("*")
, use targeted selectors - Cache commonly used elements: Store frequently accessed elements in variables
- Process in batches: For large datasets, process elements in smaller batches
- Consider streaming: For very large documents, consider streaming parsing approaches
Common Use Cases
SwiftSoup's custom tag parsing capabilities are particularly useful when working with:
- Single Page Applications (SPAs): Modern frameworks often use custom elements
- XML-based APIs: Many APIs return XML with custom namespaces
- Legacy HTML: Older websites may use proprietary tags
- Web Components: Modern web development increasingly uses custom elements
For complex JavaScript-heavy applications that require dynamic content loading, you might also want to explore how to handle browser sessions in Puppeteer for more advanced scraping scenarios.
Conclusion
SwiftSoup provides excellent support for parsing HTML with custom or unknown tags, making it an ideal choice for iOS developers working on web scraping projects. Its flexible parsing engine handles various edge cases gracefully while providing a clean, Swift-friendly API for extracting data from complex HTML structures.
Whether you're working with modern web components, XML namespaces, or legacy HTML with proprietary tags, SwiftSoup's robust parsing capabilities ensure your iOS applications can effectively extract the data they need. Remember to always implement proper error handling and consider performance implications when working with large documents containing numerous custom elements.
For additional web scraping challenges involving dynamic content, consider exploring how to handle AJAX requests using Puppeteer for scenarios where client-side rendering is involved.