Table of contents

How do I parse HTML from a string using SwiftSoup?

SwiftSoup is a pure Swift HTML parser that provides a convenient way to parse, extract, and manipulate HTML content from strings. It's inspired by the popular Java library JSoup and offers similar functionality for iOS and macOS developers. This guide covers everything you need to know about parsing HTML strings with SwiftSoup.

Installation and Setup

Before parsing HTML strings, you need to add SwiftSoup to your project. Add it to your Package.swift file:

dependencies: [
    .package(url: "https://github.com/scinfu/SwiftSoup.git", from: "2.6.0")
]

Or if using Xcode, add the package through File → Add Package Dependencies.

Import SwiftSoup in your Swift file:

import SwiftSoup

Basic HTML String Parsing

The fundamental method for parsing HTML from a string is using SwiftSoup.parse(). Here's the basic syntax:

import SwiftSoup

let htmlString = """
<html>
<head>
    <title>Sample Page</title>
</head>
<body>
    <h1>Welcome to SwiftSoup</h1>
    <p class="intro">This is a sample paragraph.</p>
    <div id="content">
        <ul>
            <li>Item 1</li>
            <li>Item 2</li>
            <li>Item 3</li>
        </ul>
    </div>
</body>
</html>
"""

do {
    let doc: Document = try SwiftSoup.parse(htmlString)
    print("Document parsed successfully")
    print("Title: \(try doc.title())")
} catch Exception.Error(let type, let message) {
    print("Error: \(type) - \(message)")
} catch {
    print("Unexpected error: \(error)")
}

Extracting Specific Elements

Once you have a parsed document, you can extract specific elements using CSS selectors or element traversal methods:

Using CSS Selectors

do {
    let doc = try SwiftSoup.parse(htmlString)

    // Select by tag name
    let headings = try doc.select("h1")
    for heading in headings {
        print("Heading: \(try heading.text())")
    }

    // Select by class
    let introElements = try doc.select(".intro")
    for element in introElements {
        print("Intro text: \(try element.text())")
    }

    // Select by ID
    let contentDiv = try doc.select("#content").first()
    if let content = contentDiv {
        print("Content HTML: \(try content.html())")
    }

    // Complex selectors
    let listItems = try doc.select("div#content ul li")
    for item in listItems {
        print("List item: \(try item.text())")
    }

} catch {
    print("Parsing error: \(error)")
}

Traversing Elements

do {
    let doc = try SwiftSoup.parse(htmlString)

    // Get all paragraphs
    let paragraphs = try doc.getElementsByTag("p")

    // Get first paragraph
    if let firstParagraph = paragraphs.first() {
        print("First paragraph: \(try firstParagraph.text())")

        // Get attributes
        let className = try firstParagraph.attr("class")
        print("Class attribute: \(className)")
    }

    // Get elements by attribute
    let elementsWithClass = try doc.getElementsByAttributeValue("class", "intro")

} catch {
    print("Error traversing elements: \(error)")
}

Working with Malformed HTML

SwiftSoup is forgiving with malformed HTML and will attempt to create a valid document structure:

let malformedHTML = """
<div>
    <p>Unclosed paragraph
    <span>Nested span</div>
<div>Another div
"""

do {
    let doc = try SwiftSoup.parse(malformedHTML)

    // SwiftSoup automatically closes unclosed tags
    print("Cleaned HTML:")
    print(try doc.html())

    // Extract text content
    let textContent = try doc.text()
    print("Text content: \(textContent)")

} catch {
    print("Error parsing malformed HTML: \(error)")
}

Extracting Data from Tables

When dealing with structured data like tables, SwiftSoup provides efficient methods to extract information:

let tableHTML = """
<table id="data-table">
    <thead>
        <tr>
            <th>Name</th>
            <th>Age</th>
            <th>City</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>John Doe</td>
            <td>30</td>
            <td>New York</td>
        </tr>
        <tr>
            <td>Jane Smith</td>
            <td>25</td>
            <td>Los Angeles</td>
        </tr>
    </tbody>
</table>
"""

do {
    let doc = try SwiftSoup.parse(tableHTML)

    // Extract table headers
    let headers = try doc.select("table#data-table thead th")
    let headerTexts = try headers.map { try $0.text() }
    print("Headers: \(headerTexts)")

    // Extract table rows
    let rows = try doc.select("table#data-table tbody tr")

    for row in rows {
        let cells = try row.select("td")
        let cellTexts = try cells.map { try $0.text() }
        print("Row data: \(cellTexts)")
    }

} catch {
    print("Error parsing table: \(error)")
}

Advanced Parsing Techniques

Parsing HTML Fragments

For parsing HTML fragments (not complete documents), use parseBodyFragment():

let htmlFragment = """
<div class="product">
    <h3>Product Name</h3>
    <p class="price">$29.99</p>
    <button onclick="addToCart()">Add to Cart</button>
</div>
"""

do {
    let doc = try SwiftSoup.parseBodyFragment(htmlFragment)
    let body = doc.body()!

    let productName = try body.select("h3").first()?.text() ?? ""
    let price = try body.select(".price").first()?.text() ?? ""

    print("Product: \(productName), Price: \(price)")

} catch {
    print("Error parsing fragment: \(error)")
}

Custom Base URI

When parsing HTML that contains relative URLs, you can specify a base URI:

let htmlWithLinks = """
<div>
    <a href="/page1">Page 1</a>
    <img src="images/photo.jpg" alt="Photo">
</div>
"""

do {
    let baseUri = "https://example.com"
    let doc = try SwiftSoup.parse(htmlWithLinks, baseUri)

    // Get absolute URLs
    let links = try doc.select("a[href]")
    for link in links {
        let absoluteUrl = try link.attr("abs:href")
        print("Absolute URL: \(absoluteUrl)")
    }

    let images = try doc.select("img[src]")
    for img in images {
        let absoluteSrc = try img.attr("abs:src")
        print("Absolute image URL: \(absoluteSrc)")
    }

} catch {
    print("Error parsing with base URI: \(error)")
}

Error Handling Best Practices

Always wrap SwiftSoup operations in do-catch blocks and handle specific error types:

func parseHTMLSafely(_ htmlString: String) -> Document? {
    do {
        let doc = try SwiftSoup.parse(htmlString)
        return doc
    } catch Exception.Error(let type, let message) {
        print("SwiftSoup Error - Type: \(type), Message: \(message)")
        return nil
    } catch {
        print("Unexpected error: \(error.localizedDescription)")
        return nil
    }
}

// Usage
if let document = parseHTMLSafely(htmlString) {
    // Safely work with the document
    do {
        let title = try document.title()
        print("Document title: \(title)")
    } catch {
        print("Error extracting title: \(error)")
    }
}

Performance Considerations

When parsing large HTML strings or processing multiple documents:

  1. Reuse selectors: Cache frequently used CSS selectors
  2. Use specific selectors: More specific selectors perform better than broad ones
  3. Parse fragments when possible: Use parseBodyFragment() for partial HTML
  4. Handle memory efficiently: Process large documents in chunks when possible
class HTMLParser {
    private let titleSelector = "title"
    private let metaSelector = "meta[name=description]"

    func extractMetadata(from htmlString: String) -> (title: String, description: String) {
        do {
            let doc = try SwiftSoup.parse(htmlString)

            let title = try doc.select(titleSelector).first()?.text() ?? ""
            let description = try doc.select(metaSelector).first()?.attr("content") ?? ""

            return (title: title, description: description)
        } catch {
            print("Error extracting metadata: \(error)")
            return (title: "", description: "")
        }
    }
}

Integration with Web Scraping Workflows

SwiftSoup works excellently in web scraping workflows where you need to parse HTML content retrieved from web requests. While SwiftSoup handles the HTML parsing, you might need additional tools for JavaScript-heavy sites, similar to how Puppeteer handles dynamic content in web applications.

For comprehensive web scraping projects, consider combining SwiftSoup with networking libraries like URLSession or Alamofire to fetch HTML content, then parse it with SwiftSoup for data extraction.

Conclusion

SwiftSoup provides a robust and Swift-native solution for parsing HTML from strings. Its jQuery-like selector syntax makes it familiar to web developers, while its error-handling capabilities ensure your apps can gracefully handle malformed HTML. Whether you're building a simple HTML parser or a complex web scraping solution, SwiftSoup offers the tools you need to extract and manipulate HTML content effectively.

Remember to always handle parsing errors appropriately and consider performance implications when working with large HTML documents. With proper implementation, SwiftSoup can be a powerful tool in your iOS or macOS development toolkit.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon