Table of contents

How do I clean up HTML with SwiftSoup to remove unwanted tags?

SwiftSoup is a powerful Swift library for parsing, manipulating, and cleaning HTML content. When you need to remove unwanted or potentially dangerous tags from HTML documents, SwiftSoup provides flexible selection methods to target and remove specific elements.

Installation

Add SwiftSoup to your project using your preferred package manager:

CocoaPods

pod 'SwiftSoup'

Swift Package Manager

dependencies: [
    .package(url: "https://github.com/scinfu/SwiftSoup.git", from: "2.6.0")
]

Basic HTML Cleaning

Here's a comprehensive example showing how to clean HTML by removing common unwanted tags:

import SwiftSoup

func cleanHTML(_ html: String) -> String? {
    do {
        let doc: Document = try SwiftSoup.parse(html)

        // Remove script and style tags (security and formatting)
        try doc.select("script, style").remove()

        // Remove potentially dangerous tags
        try doc.select("iframe, frame, embed, object, applet").remove()

        // Remove form elements if not needed
        try doc.select("form, input, button, textarea, select").remove()

        return try doc.html()
    } catch {
        print("Error cleaning HTML: \(error.localizedDescription)")
        return nil
    }
}

let originalHTML = """
<!DOCTYPE html>
<html>
<head>
    <title>Sample Page</title>
    <style>body { font-family: Arial; }</style>
    <script>alert('popup');</script>
</head>
<body>
    <h1>Article Title</h1>
    <p>This is clean content.</p>
    <iframe src="https://example.com"></iframe>
    <form><input type="text"></form>
</body>
</html>
"""

if let cleanedHTML = cleanHTML(originalHTML) {
    print(cleanedHTML)
}

Advanced Cleaning Techniques

Remove Elements by Attributes

func cleanHTMLByAttributes(_ html: String) -> String? {
    do {
        let doc: Document = try SwiftSoup.parse(html)

        // Remove elements with specific classes
        try doc.select(".advertisement, .popup, .tracking").remove()

        // Remove elements with inline styles
        try doc.select("[style]").removeAttr("style")

        // Remove elements with specific attributes
        try doc.select("[onclick], [onload], [onerror]").remove()

        return try doc.html()
    } catch {
        print("Error: \(error)")
        return nil
    }
}

Whitelist Approach - Keep Only Safe Tags

func keepSafeTags(_ html: String) -> String? {
    do {
        let doc: Document = try SwiftSoup.parse(html)

        // Define allowed tags
        let safeTags = ["p", "h1", "h2", "h3", "h4", "h5", "h6", 
                       "strong", "em", "ul", "ol", "li", "a", "img"]

        // Remove all elements not in the safe list
        let allElements = try doc.select("*")
        for element in allElements {
            if !safeTags.contains(element.tagName()) {
                try element.remove()
            }
        }

        return try doc.html()
    } catch {
        print("Error: \(error)")
        return nil
    }
}

Text-Only Extraction

func extractCleanText(_ html: String) -> String? {
    do {
        let doc: Document = try SwiftSoup.parse(html)

        // Remove unwanted elements first
        try doc.select("script, style, nav, footer, aside").remove()

        // Extract only text content
        return try doc.text()
    } catch {
        print("Error: \(error)")
        return nil
    }
}

Comprehensive HTML Sanitizer

Here's a more robust HTML sanitizer for production use:

struct HTMLSanitizer {
    private let allowedTags: Set<String>
    private let allowedAttributes: [String: Set<String>]

    init() {
        self.allowedTags = ["p", "h1", "h2", "h3", "h4", "h5", "h6",
                           "strong", "em", "b", "i", "u", "br",
                           "ul", "ol", "li", "a", "img", "blockquote"]

        self.allowedAttributes = [
            "a": ["href", "title"],
            "img": ["src", "alt", "width", "height"]
        ]
    }

    func sanitize(_ html: String) -> String? {
        do {
            let doc: Document = try SwiftSoup.parse(html)

            // Remove dangerous tags
            try doc.select("script, style, iframe, frame, object, embed, applet").remove()

            // Clean attributes
            let allElements = try doc.select("*")
            for element in allElements {
                let tagName = element.tagName()

                // Remove tag if not allowed
                guard allowedTags.contains(tagName) else {
                    try element.remove()
                    continue
                }

                // Clean attributes
                let attributes = element.getAttributes()
                for attribute in attributes {
                    let attrName = attribute.getKey()
                    let allowedAttrs = allowedAttributes[tagName] ?? Set<String>()

                    if !allowedAttrs.contains(attrName) {
                        element.removeAttr(attrName)
                    }
                }
            }

            return try doc.body()?.html() ?? ""
        } catch {
            print("Sanitization error: \(error)")
            return nil
        }
    }
}

// Usage
let sanitizer = HTMLSanitizer()
let cleanHTML = sanitizer.sanitize(maliciousHTML)

Best Practices

  1. Always handle exceptions - SwiftSoup methods can throw errors
  2. Use CSS selectors effectively - Combine multiple selectors for efficiency
  3. Consider performance - For large documents, minimize DOM traversals
  4. Validate URLs - When keeping links, validate href attributes
  5. Test thoroughly - Test with various HTML structures and edge cases

Common Use Cases

  • Web scraping cleanup - Remove navigation, ads, and scripts
  • User-generated content - Sanitize HTML from rich text editors
  • Email HTML - Clean HTML for email templates
  • Content extraction - Extract article content from web pages

SwiftSoup's flexible selection API makes it easy to target exactly the content you want to remove or preserve, ensuring your HTML is clean and safe for your application's needs.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon