How do I parse HTML fragments instead of complete documents with SwiftSoup?
SwiftSoup provides specialized methods for parsing HTML fragments rather than complete documents. This is particularly useful when working with partial HTML content, user-generated content, or when extracting specific portions of web pages in iOS applications.
Understanding HTML Fragments vs Complete Documents
HTML fragments are partial HTML content that don't contain the complete document structure (html, head, body tags). Examples include:
- Content from APIs or databases
- User-generated HTML content
- Partial HTML snippets
- Email templates or content blocks
SwiftSoup handles fragments differently from complete documents to ensure proper parsing and DOM structure.
Basic Fragment Parsing
Using parseBodyFragment()
The primary method for parsing HTML fragments in SwiftSoup is parseBodyFragment()
:
import SwiftSoup
// Parse a simple HTML fragment
let htmlFragment = "<div class='content'><p>Hello World</p><span>Test</span></div>"
do {
let doc = try SwiftSoup.parseBodyFragment(htmlFragment)
let body = doc.body()
// Extract content
let content = try body?.select("div.content")
print(try content?.text() ?? "No content found")
// Output: Hello World Test
} catch {
print("Error parsing fragment: \(error)")
}
Parsing with Base URI
When parsing fragments that contain relative URLs, specify a base URI:
let htmlFragment = """
<div>
<img src="/images/logo.png" alt="Logo">
<a href="/about">About Us</a>
</div>
"""
let baseUri = "https://example.com"
do {
let doc = try SwiftSoup.parseBodyFragment(htmlFragment, baseUri)
// Get absolute URLs
let images = try doc.select("img")
for img in images {
let absoluteSrc = try img.absUrl("src")
print("Image URL: \(absoluteSrc)")
// Output: Image URL: https://example.com/images/logo.png
}
} catch {
print("Error: \(error)")
}
Advanced Fragment Parsing Techniques
Parsing Multiple Fragments
When working with multiple HTML fragments, you can combine them or process them individually:
let fragments = [
"<div class='item'>Item 1</div>",
"<div class='item'>Item 2</div>",
"<div class='item'>Item 3</div>"
]
var allItems: [Element] = []
for fragment in fragments {
do {
let doc = try SwiftSoup.parseBodyFragment(fragment)
let items = try doc.select("div.item")
allItems.append(contentsOf: items)
} catch {
print("Error parsing fragment: \(error)")
}
}
print("Total items parsed: \(allItems.count)")
Fragment Parsing with Custom Settings
You can create a custom parser for fragments with specific settings:
import SwiftSoup
func parseFragmentWithCustomSettings(_ html: String) throws -> Document {
// Parse as fragment
let doc = try SwiftSoup.parseBodyFragment(html)
// Normalize the document
doc.normalise()
// Set output settings
try doc.outputSettings()
.prettyPrint(pretty: true)
.indentAmount(2)
return doc
}
// Usage
let htmlFragment = "<div><p>Unformatted content</p></div>"
do {
let doc = try parseFragmentWithCustomSettings(htmlFragment)
let prettyHtml = try doc.html()
print(prettyHtml)
} catch {
print("Error: \(error)")
}
Working with Fragment Content
Extracting Data from Fragments
Here's how to extract specific data from HTML fragments:
let productFragment = """
<div class="product" data-id="123">
<h3 class="title">iPhone 15</h3>
<span class="price">$999</span>
<div class="description">
<p>Latest iPhone with advanced features</p>
<ul class="features">
<li>A17 Pro chip</li>
<li>48MP camera</li>
<li>USB-C</li>
</ul>
</div>
</div>
"""
do {
let doc = try SwiftSoup.parseBodyFragment(productFragment)
// Extract product details
let productId = try doc.select("div.product").first()?.attr("data-id") ?? ""
let title = try doc.select("h3.title").text()
let price = try doc.select("span.price").text()
let features = try doc.select("ul.features li").map { try $0.text() }
print("Product ID: \(productId)")
print("Title: \(title)")
print("Price: \(price)")
print("Features: \(features)")
} catch {
print("Error extracting data: \(error)")
}
Modifying Fragment Content
SwiftSoup allows you to modify parsed fragments before using them:
let htmlFragment = """
<div class="content">
<p>Original content</p>
<img src="old-image.jpg" alt="Old Image">
</div>
"""
do {
let doc = try SwiftSoup.parseBodyFragment(htmlFragment)
// Modify content
try doc.select("p").first()?.text("Updated content")
try doc.select("img").first()?.attr("src", "new-image.jpg")
try doc.select("img").first()?.attr("alt", "New Image")
// Add new elements
let newDiv = try doc.createElement("div")
try newDiv.attr("class", "footer")
try newDiv.text("Added footer content")
try doc.body()?.appendChild(newDiv)
// Get modified HTML
let modifiedHtml = try doc.body()?.html() ?? ""
print(modifiedHtml)
} catch {
print("Error modifying fragment: \(error)")
}
Best Practices for Fragment Parsing
Handling Malformed Fragments
SwiftSoup automatically corrects malformed HTML, but you should validate your fragments:
func parseAndValidateFragment(_ html: String) -> Document? {
do {
let doc = try SwiftSoup.parseBodyFragment(html)
// Validate structure
guard let body = doc.body() else {
print("Warning: Fragment produced empty body")
return nil
}
// Check for parsing errors
let errors = doc.getErrors()
if !errors.isEmpty {
print("Parsing warnings: \(errors)")
}
return doc
} catch {
print("Failed to parse fragment: \(error)")
return nil
}
}
// Test with malformed HTML
let malformedFragment = "<div><p>Unclosed paragraph<span>Nested content</div>"
if let doc = parseAndValidateFragment(malformedFragment) {
print("Successfully parsed and corrected malformed fragment")
}
Performance Considerations
When parsing many fragments, consider reusing parser instances:
class FragmentParser {
private var parser: Parser
init() {
self.parser = Parser.htmlParser()
}
func parseFragment(_ html: String, baseUri: String = "") throws -> Document {
return try SwiftSoup.parseBodyFragment(html, baseUri)
}
func parseBatch(_ fragments: [String]) -> [Document] {
return fragments.compactMap { fragment in
try? parseFragment(fragment)
}
}
}
// Usage
let parser = FragmentParser()
let fragments = ["<div>Fragment 1</div>", "<div>Fragment 2</div>"]
let documents = parser.parseBatch(fragments)
Error Handling and Debugging
Comprehensive Error Handling
enum FragmentParsingError: Error {
case emptyFragment
case parsingFailed(String)
case invalidStructure
}
func robustFragmentParser(_ html: String) throws -> Document {
guard !html.trimmingCharacters(in: .whitespacesAndNewlines).isEmpty else {
throw FragmentParsingError.emptyFragment
}
do {
let doc = try SwiftSoup.parseBodyFragment(html)
// Verify we have valid content
guard let body = doc.body(), try body.children().size() > 0 else {
throw FragmentParsingError.invalidStructure
}
return doc
} catch let error as SwiftSoupError {
throw FragmentParsingError.parsingFailed(error.localizedDescription)
} catch {
throw FragmentParsingError.parsingFailed("Unknown parsing error")
}
}
Integration with iOS Applications
Using Fragments in Table Views
class HTMLFragmentTableViewCell: UITableViewCell {
@IBOutlet weak var webView: WKWebView!
func configure(with fragment: String) {
do {
let doc = try SwiftSoup.parseBodyFragment(fragment)
// Add CSS styling
let head = doc.head()
let style = try doc.createElement("style")
try style.html("""
body { font-family: -apple-system; margin: 10px; }
.content { line-height: 1.4; }
""")
try head?.appendChild(style)
let fullHtml = try doc.outerHtml()
webView.loadHTMLString(fullHtml, baseURL: nil)
} catch {
print("Error configuring cell: \(error)")
}
}
}
Processing Fragment Collections
When working with collections of fragments, such as from RSS feeds or API responses:
struct ContentProcessor {
func processFragmentCollection(_ fragments: [String]) -> [ProcessedContent] {
return fragments.compactMap { fragment in
do {
let doc = try SwiftSoup.parseBodyFragment(fragment)
// Extract standardized data
let title = try doc.select("h1, h2, h3").first()?.text() ?? ""
let text = try doc.select("p").text()
let images = try doc.select("img").map { try $0.attr("src") }
return ProcessedContent(title: title, text: text, images: images)
} catch {
print("Failed to process fragment: \(error)")
return nil
}
}
}
}
struct ProcessedContent {
let title: String
let text: String
let images: [String]
}
Security Considerations
Sanitizing User-Generated Fragments
When dealing with user-generated HTML fragments, always sanitize the content:
func sanitizeFragment(_ html: String) -> String? {
do {
let doc = try SwiftSoup.parseBodyFragment(html)
// Remove potentially dangerous tags
try doc.select("script, iframe, object, embed").remove()
// Remove JavaScript event handlers
let elements = try doc.select("*")
for element in elements {
let attributes = element.getAttributes()
for attr in attributes {
if attr.getKey().lowercased().hasPrefix("on") {
element.removeAttr(attr.getKey())
}
}
}
// Allow only safe attributes
let allowedTags = ["p", "div", "span", "strong", "em", "ul", "ol", "li", "h1", "h2", "h3", "h4", "h5", "h6"]
let allowedAttrs = ["class", "id"]
// This is a simplified example - consider using a proper HTML sanitizer
return try doc.body()?.html()
} catch {
print("Error sanitizing fragment: \(error)")
return nil
}
}
Comparison with Other Parsing Methods
Unlike parsing complete documents, fragment parsing with parseBodyFragment()
offers several advantages:
- Automatic wrapping: Fragments are automatically wrapped in proper HTML structure
- Context preservation: Maintains proper DOM relationships
- Error correction: Automatically fixes unclosed tags and malformed HTML
- Base URI support: Resolves relative URLs when provided
Fragment parsing is essential when working with partial HTML content in iOS development. It ensures your content is properly structured and ready for display or further processing. For web-based applications dealing with dynamic content, you might also want to understand how to handle AJAX requests using Puppeteer for similar challenges in different contexts.
Common Use Cases
Processing Rich Text Content
func processRichTextFragment(_ html: String) -> NSAttributedString? {
do {
let doc = try SwiftSoup.parseBodyFragment(html)
// Convert to attributed string for display in UITextView
let htmlData = try doc.html().data(using: .utf8)
return try NSAttributedString(
data: htmlData ?? Data(),
options: [.documentType: NSAttributedString.DocumentType.html,
.characterEncoding: String.Encoding.utf8.rawValue],
documentAttributes: nil
)
} catch {
print("Error processing rich text: \(error)")
return nil
}
}
Fragment-Based Template System
class TemplateProcessor {
func processTemplate(_ template: String, with data: [String: String]) -> String? {
do {
let doc = try SwiftSoup.parseBodyFragment(template)
// Replace template variables
for (key, value) in data {
let selector = "[data-template='\(key)']"
let elements = try doc.select(selector)
for element in elements {
try element.text(value)
}
}
return try doc.body()?.html()
} catch {
print("Error processing template: \(error)")
return nil
}
}
}
// Usage
let template = "<div><span data-template='username'>{{username}}</span></div>"
let processor = TemplateProcessor()
let result = processor.processTemplate(template, with: ["username": "John Doe"])
Understanding the differences between fragment and document parsing is crucial for building robust iOS applications that handle HTML content effectively, especially when dealing with user-generated content or API responses that return partial HTML structures. This approach ensures better performance, security, and maintainability in your SwiftSoup-based applications.