How do I clean up HTML with SwiftSoup to remove unwanted tags?

SwiftSoup is a Swift library for parsing, manipulating, and cleaning HTML. When you want to remove unwanted tags from an HTML document, you can use SwiftSoup to select those tags and remove them.

Here's an example of how you might use SwiftSoup to clean up HTML by removing specific unwanted tags. In this example, we'll remove all <script> and <style> tags from an HTML string.

First, ensure you have SwiftSoup installed in your project. If you're using CocoaPods, add the following line to your Podfile:

pod 'SwiftSoup'

Then run pod install.

Next, you can use SwiftSoup in your Swift code like this:

import SwiftSoup

func cleanHTML(_ html: String) -> String? {
    do {
        // Parse the HTML string.
        let doc: Document = try SwiftSoup.parse(html)

        // Select and remove all script and style tags.
        try doc.select("script, style").remove()

        // You can also remove other unwanted tags, for example:
        // try doc.select("iframe, frame, embed").remove()

        // Return the cleaned HTML string.
        return try doc.html()
    } catch {
        // Handle error
        print("Error cleaning HTML: \(error.localizedDescription)")
        return nil
    }
}

let originalHTML = """
<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
<style>
    body {font-family: Arial, sans-serif;}
</style>
<script>
    console.log('This is a script tag');
</script>
</head>
<body>
<h1>This is a Heading</h1>
<p>This is a paragraph.</p>
</body>
</html>
"""

if let cleanedHTML = cleanHTML(originalHTML) {
    print(cleanedHTML)
} else {
    print("Failed to clean the HTML.")
}

In the code above:

  • We define a function cleanHTML that takes an HTML string as input.
  • We parse the HTML into a Document object using SwiftSoup's parse method.
  • We use the select method to find all <script> and <style> elements within the document.
  • We call remove on the selected elements to remove them from the document.
  • Finally, we return the cleaned HTML as a string using the html method.

You can customize the select method argument to target different tags or even specific elements with particular attributes or classes that you wish to remove. For instance, doc.select(".unwanted-class") would remove all elements with the class unwanted-class.

Remember to handle the errors appropriately in your actual application. The above example prints the error message, but in a production environment, you might want to log the error or present an error message to the user.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon