SwiftSoup is a powerful Swift library for parsing, manipulating, and cleaning HTML content. When you need to remove unwanted or potentially dangerous tags from HTML documents, SwiftSoup provides flexible selection methods to target and remove specific elements.
Installation
Add SwiftSoup to your project using your preferred package manager:
CocoaPods
pod 'SwiftSoup'
Swift Package Manager
dependencies: [
.package(url: "https://github.com/scinfu/SwiftSoup.git", from: "2.6.0")
]
Basic HTML Cleaning
Here's a comprehensive example showing how to clean HTML by removing common unwanted tags:
import SwiftSoup
func cleanHTML(_ html: String) -> String? {
do {
let doc: Document = try SwiftSoup.parse(html)
// Remove script and style tags (security and formatting)
try doc.select("script, style").remove()
// Remove potentially dangerous tags
try doc.select("iframe, frame, embed, object, applet").remove()
// Remove form elements if not needed
try doc.select("form, input, button, textarea, select").remove()
return try doc.html()
} catch {
print("Error cleaning HTML: \(error.localizedDescription)")
return nil
}
}
let originalHTML = """
<!DOCTYPE html>
<html>
<head>
<title>Sample Page</title>
<style>body { font-family: Arial; }</style>
<script>alert('popup');</script>
</head>
<body>
<h1>Article Title</h1>
<p>This is clean content.</p>
<iframe src="https://example.com"></iframe>
<form><input type="text"></form>
</body>
</html>
"""
if let cleanedHTML = cleanHTML(originalHTML) {
print(cleanedHTML)
}
Advanced Cleaning Techniques
Remove Elements by Attributes
func cleanHTMLByAttributes(_ html: String) -> String? {
do {
let doc: Document = try SwiftSoup.parse(html)
// Remove elements with specific classes
try doc.select(".advertisement, .popup, .tracking").remove()
// Remove elements with inline styles
try doc.select("[style]").removeAttr("style")
// Remove elements with specific attributes
try doc.select("[onclick], [onload], [onerror]").remove()
return try doc.html()
} catch {
print("Error: \(error)")
return nil
}
}
Whitelist Approach - Keep Only Safe Tags
func keepSafeTags(_ html: String) -> String? {
do {
let doc: Document = try SwiftSoup.parse(html)
// Define allowed tags
let safeTags = ["p", "h1", "h2", "h3", "h4", "h5", "h6",
"strong", "em", "ul", "ol", "li", "a", "img"]
// Remove all elements not in the safe list
let allElements = try doc.select("*")
for element in allElements {
if !safeTags.contains(element.tagName()) {
try element.remove()
}
}
return try doc.html()
} catch {
print("Error: \(error)")
return nil
}
}
Text-Only Extraction
func extractCleanText(_ html: String) -> String? {
do {
let doc: Document = try SwiftSoup.parse(html)
// Remove unwanted elements first
try doc.select("script, style, nav, footer, aside").remove()
// Extract only text content
return try doc.text()
} catch {
print("Error: \(error)")
return nil
}
}
Comprehensive HTML Sanitizer
Here's a more robust HTML sanitizer for production use:
struct HTMLSanitizer {
private let allowedTags: Set<String>
private let allowedAttributes: [String: Set<String>]
init() {
self.allowedTags = ["p", "h1", "h2", "h3", "h4", "h5", "h6",
"strong", "em", "b", "i", "u", "br",
"ul", "ol", "li", "a", "img", "blockquote"]
self.allowedAttributes = [
"a": ["href", "title"],
"img": ["src", "alt", "width", "height"]
]
}
func sanitize(_ html: String) -> String? {
do {
let doc: Document = try SwiftSoup.parse(html)
// Remove dangerous tags
try doc.select("script, style, iframe, frame, object, embed, applet").remove()
// Clean attributes
let allElements = try doc.select("*")
for element in allElements {
let tagName = element.tagName()
// Remove tag if not allowed
guard allowedTags.contains(tagName) else {
try element.remove()
continue
}
// Clean attributes
let attributes = element.getAttributes()
for attribute in attributes {
let attrName = attribute.getKey()
let allowedAttrs = allowedAttributes[tagName] ?? Set<String>()
if !allowedAttrs.contains(attrName) {
element.removeAttr(attrName)
}
}
}
return try doc.body()?.html() ?? ""
} catch {
print("Sanitization error: \(error)")
return nil
}
}
}
// Usage
let sanitizer = HTMLSanitizer()
let cleanHTML = sanitizer.sanitize(maliciousHTML)
Best Practices
- Always handle exceptions - SwiftSoup methods can throw errors
- Use CSS selectors effectively - Combine multiple selectors for efficiency
- Consider performance - For large documents, minimize DOM traversals
- Validate URLs - When keeping links, validate href attributes
- Test thoroughly - Test with various HTML structures and edge cases
Common Use Cases
- Web scraping cleanup - Remove navigation, ads, and scripts
- User-generated content - Sanitize HTML from rich text editors
- Email HTML - Clean HTML for email templates
- Content extraction - Extract article content from web pages
SwiftSoup's flexible selection API makes it easy to target exactly the content you want to remove or preserve, ensuring your HTML is clean and safe for your application's needs.