How do I parse HTML from a string using SwiftSoup?
SwiftSoup is a pure Swift HTML parser that provides a convenient way to parse, extract, and manipulate HTML content from strings. It's inspired by the popular Java library JSoup and offers similar functionality for iOS and macOS developers. This guide covers everything you need to know about parsing HTML strings with SwiftSoup.
Installation and Setup
Before parsing HTML strings, you need to add SwiftSoup to your project. Add it to your Package.swift
file:
dependencies: [
.package(url: "https://github.com/scinfu/SwiftSoup.git", from: "2.6.0")
]
Or if using Xcode, add the package through File → Add Package Dependencies.
Import SwiftSoup in your Swift file:
import SwiftSoup
Basic HTML String Parsing
The fundamental method for parsing HTML from a string is using SwiftSoup.parse()
. Here's the basic syntax:
import SwiftSoup
let htmlString = """
<html>
<head>
<title>Sample Page</title>
</head>
<body>
<h1>Welcome to SwiftSoup</h1>
<p class="intro">This is a sample paragraph.</p>
<div id="content">
<ul>
<li>Item 1</li>
<li>Item 2</li>
<li>Item 3</li>
</ul>
</div>
</body>
</html>
"""
do {
let doc: Document = try SwiftSoup.parse(htmlString)
print("Document parsed successfully")
print("Title: \(try doc.title())")
} catch Exception.Error(let type, let message) {
print("Error: \(type) - \(message)")
} catch {
print("Unexpected error: \(error)")
}
Extracting Specific Elements
Once you have a parsed document, you can extract specific elements using CSS selectors or element traversal methods:
Using CSS Selectors
do {
let doc = try SwiftSoup.parse(htmlString)
// Select by tag name
let headings = try doc.select("h1")
for heading in headings {
print("Heading: \(try heading.text())")
}
// Select by class
let introElements = try doc.select(".intro")
for element in introElements {
print("Intro text: \(try element.text())")
}
// Select by ID
let contentDiv = try doc.select("#content").first()
if let content = contentDiv {
print("Content HTML: \(try content.html())")
}
// Complex selectors
let listItems = try doc.select("div#content ul li")
for item in listItems {
print("List item: \(try item.text())")
}
} catch {
print("Parsing error: \(error)")
}
Traversing Elements
do {
let doc = try SwiftSoup.parse(htmlString)
// Get all paragraphs
let paragraphs = try doc.getElementsByTag("p")
// Get first paragraph
if let firstParagraph = paragraphs.first() {
print("First paragraph: \(try firstParagraph.text())")
// Get attributes
let className = try firstParagraph.attr("class")
print("Class attribute: \(className)")
}
// Get elements by attribute
let elementsWithClass = try doc.getElementsByAttributeValue("class", "intro")
} catch {
print("Error traversing elements: \(error)")
}
Working with Malformed HTML
SwiftSoup is forgiving with malformed HTML and will attempt to create a valid document structure:
let malformedHTML = """
<div>
<p>Unclosed paragraph
<span>Nested span</div>
<div>Another div
"""
do {
let doc = try SwiftSoup.parse(malformedHTML)
// SwiftSoup automatically closes unclosed tags
print("Cleaned HTML:")
print(try doc.html())
// Extract text content
let textContent = try doc.text()
print("Text content: \(textContent)")
} catch {
print("Error parsing malformed HTML: \(error)")
}
Extracting Data from Tables
When dealing with structured data like tables, SwiftSoup provides efficient methods to extract information:
let tableHTML = """
<table id="data-table">
<thead>
<tr>
<th>Name</th>
<th>Age</th>
<th>City</th>
</tr>
</thead>
<tbody>
<tr>
<td>John Doe</td>
<td>30</td>
<td>New York</td>
</tr>
<tr>
<td>Jane Smith</td>
<td>25</td>
<td>Los Angeles</td>
</tr>
</tbody>
</table>
"""
do {
let doc = try SwiftSoup.parse(tableHTML)
// Extract table headers
let headers = try doc.select("table#data-table thead th")
let headerTexts = try headers.map { try $0.text() }
print("Headers: \(headerTexts)")
// Extract table rows
let rows = try doc.select("table#data-table tbody tr")
for row in rows {
let cells = try row.select("td")
let cellTexts = try cells.map { try $0.text() }
print("Row data: \(cellTexts)")
}
} catch {
print("Error parsing table: \(error)")
}
Advanced Parsing Techniques
Parsing HTML Fragments
For parsing HTML fragments (not complete documents), use parseBodyFragment()
:
let htmlFragment = """
<div class="product">
<h3>Product Name</h3>
<p class="price">$29.99</p>
<button onclick="addToCart()">Add to Cart</button>
</div>
"""
do {
let doc = try SwiftSoup.parseBodyFragment(htmlFragment)
let body = doc.body()!
let productName = try body.select("h3").first()?.text() ?? ""
let price = try body.select(".price").first()?.text() ?? ""
print("Product: \(productName), Price: \(price)")
} catch {
print("Error parsing fragment: \(error)")
}
Custom Base URI
When parsing HTML that contains relative URLs, you can specify a base URI:
let htmlWithLinks = """
<div>
<a href="/page1">Page 1</a>
<img src="images/photo.jpg" alt="Photo">
</div>
"""
do {
let baseUri = "https://example.com"
let doc = try SwiftSoup.parse(htmlWithLinks, baseUri)
// Get absolute URLs
let links = try doc.select("a[href]")
for link in links {
let absoluteUrl = try link.attr("abs:href")
print("Absolute URL: \(absoluteUrl)")
}
let images = try doc.select("img[src]")
for img in images {
let absoluteSrc = try img.attr("abs:src")
print("Absolute image URL: \(absoluteSrc)")
}
} catch {
print("Error parsing with base URI: \(error)")
}
Error Handling Best Practices
Always wrap SwiftSoup operations in do-catch blocks and handle specific error types:
func parseHTMLSafely(_ htmlString: String) -> Document? {
do {
let doc = try SwiftSoup.parse(htmlString)
return doc
} catch Exception.Error(let type, let message) {
print("SwiftSoup Error - Type: \(type), Message: \(message)")
return nil
} catch {
print("Unexpected error: \(error.localizedDescription)")
return nil
}
}
// Usage
if let document = parseHTMLSafely(htmlString) {
// Safely work with the document
do {
let title = try document.title()
print("Document title: \(title)")
} catch {
print("Error extracting title: \(error)")
}
}
Performance Considerations
When parsing large HTML strings or processing multiple documents:
- Reuse selectors: Cache frequently used CSS selectors
- Use specific selectors: More specific selectors perform better than broad ones
- Parse fragments when possible: Use
parseBodyFragment()
for partial HTML - Handle memory efficiently: Process large documents in chunks when possible
class HTMLParser {
private let titleSelector = "title"
private let metaSelector = "meta[name=description]"
func extractMetadata(from htmlString: String) -> (title: String, description: String) {
do {
let doc = try SwiftSoup.parse(htmlString)
let title = try doc.select(titleSelector).first()?.text() ?? ""
let description = try doc.select(metaSelector).first()?.attr("content") ?? ""
return (title: title, description: description)
} catch {
print("Error extracting metadata: \(error)")
return (title: "", description: "")
}
}
}
Integration with Web Scraping Workflows
SwiftSoup works excellently in web scraping workflows where you need to parse HTML content retrieved from web requests. While SwiftSoup handles the HTML parsing, you might need additional tools for JavaScript-heavy sites, similar to how Puppeteer handles dynamic content in web applications.
For comprehensive web scraping projects, consider combining SwiftSoup with networking libraries like URLSession or Alamofire to fetch HTML content, then parse it with SwiftSoup for data extraction.
Conclusion
SwiftSoup provides a robust and Swift-native solution for parsing HTML from strings. Its jQuery-like selector syntax makes it familiar to web developers, while its error-handling capabilities ensure your apps can gracefully handle malformed HTML. Whether you're building a simple HTML parser or a complex web scraping solution, SwiftSoup offers the tools you need to extract and manipulate HTML content effectively.
Remember to always handle parsing errors appropriately and consider performance implications when working with large HTML documents. With proper implementation, SwiftSoup can be a powerful tool in your iOS or macOS development toolkit.