Table of contents

How do I parse CSS selectors for HTML content extraction in Swift?

Parsing CSS selectors for HTML content extraction in Swift requires specialized libraries and techniques since Swift doesn't have built-in HTML parsing capabilities like web browsers. This guide covers the most effective approaches using SwiftSoup, Foundation methods, and custom parsing solutions.

Understanding CSS Selectors in Swift Context

CSS selectors are patterns used to select HTML elements for styling or data extraction. In Swift, you'll need third-party libraries to interpret these selectors and extract content from HTML documents. The most popular and reliable option is SwiftSoup, which provides jQuery-like syntax for HTML parsing.

Setting Up SwiftSoup

SwiftSoup is the most comprehensive HTML parsing library for Swift, offering full CSS selector support.

Installation via Swift Package Manager

Add SwiftSoup to your Package.swift file:

dependencies: [
    .package(url: "https://github.com/scinfu/SwiftSoup.git", from: "2.6.0")
]

Or add it through Xcode: File → Add Package Dependencies → https://github.com/scinfu/SwiftSoup.git

Installation via CocoaPods

Add to your Podfile:

pod 'SwiftSoup', '~> 2.6.0'

Basic CSS Selector Parsing with SwiftSoup

Here's how to parse HTML and extract content using CSS selectors:

import SwiftSoup

func parseHTMLWithCSSSelectors() {
    let html = """
    <html>
        <body>
            <div class="container">
                <h1 id="title">Welcome to SwiftSoup</h1>
                <p class="description">This is a paragraph with class description.</p>
                <ul class="list">
                    <li data-id="1">Item 1</li>
                    <li data-id="2">Item 2</li>
                    <li data-id="3">Item 3</li>
                </ul>
                <a href="https://example.com" class="external-link">External Link</a>
            </div>
        </body>
    </html>
    """

    do {
        let doc = try SwiftSoup.parse(html)

        // Parse by ID selector
        let title = try doc.select("#title").first()?.text() ?? ""
        print("Title: \(title)")

        // Parse by class selector
        let description = try doc.select(".description").first()?.text() ?? ""
        print("Description: \(description)")

        // Parse by tag selector
        let listItems = try doc.select("li")
        for item in listItems {
            let text = try item.text()
            let dataId = try item.attr("data-id")
            print("Item: \(text), ID: \(dataId)")
        }

        // Parse by attribute selector
        let externalLink = try doc.select("a[href^=https]").first()
        if let link = externalLink {
            let href = try link.attr("href")
            let linkText = try link.text()
            print("External link: \(linkText) -> \(href)")
        }

    } catch {
        print("Error parsing HTML: \(error)")
    }
}

Advanced CSS Selector Techniques

Combining Multiple Selectors

func advancedCSSSelectors() {
    let html = """
    <div class="article">
        <header>
            <h2 class="title">Article Title</h2>
            <span class="author">John Doe</span>
            <time datetime="2024-01-15">January 15, 2024</time>
        </header>
        <section class="content">
            <p class="intro">Introduction paragraph</p>
            <p>Regular paragraph</p>
            <p class="highlight">Important information</p>
        </section>
    </div>
    """

    do {
        let doc = try SwiftSoup.parse(html)

        // Descendant selector
        let articleTitle = try doc.select(".article .title").first()?.text() ?? ""
        print("Article title: \(articleTitle)")

        // Child selector
        let directChildren = try doc.select(".content > p")
        print("Direct paragraph children: \(directChildren.count)")

        // Adjacent sibling selector
        let authorAfterTitle = try doc.select(".title + .author").first()?.text() ?? ""
        print("Author: \(authorAfterTitle)")

        // Attribute contains selector
        let timeElement = try doc.select("time[datetime*='2024']").first()
        if let time = timeElement {
            let datetime = try time.attr("datetime")
            let text = try time.text()
            print("Time: \(text) (\(datetime))")
        }

        // Pseudo-selector equivalents
        let firstParagraph = try doc.select(".content p").first()?.text() ?? ""
        let lastParagraph = try doc.select(".content p").last()?.text() ?? ""
        print("First paragraph: \(firstParagraph)")
        print("Last paragraph: \(lastParagraph)")

    } catch {
        print("Error: \(error)")
    }
}

Web Scraping with CSS Selectors

For real-world web scraping scenarios, you'll need to fetch HTML from URLs and then parse it:

import Foundation

class WebScraper {
    func scrapeWebpage(url: String, completion: @escaping (Result<[String: Any], Error>) -> Void) {
        guard let url = URL(string: url) else {
            completion(.failure(ScrapingError.invalidURL))
            return
        }

        let task = URLSession.shared.dataTask(with: url) { data, response, error in
            if let error = error {
                completion(.failure(error))
                return
            }

            guard let data = data,
                  let htmlString = String(data: data, encoding: .utf8) else {
                completion(.failure(ScrapingError.invalidData))
                return
            }

            do {
                let extractedData = try self.parseHTMLContent(htmlString)
                completion(.success(extractedData))
            } catch {
                completion(.failure(error))
            }
        }

        task.resume()
    }

    private func parseHTMLContent(_ html: String) throws -> [String: Any] {
        let doc = try SwiftSoup.parse(html)
        var result: [String: Any] = [:]

        // Extract page title
        result["title"] = try doc.select("title").first()?.text() ?? ""

        // Extract meta description
        result["description"] = try doc.select("meta[name=description]").first()?.attr("content") ?? ""

        // Extract all links
        let links = try doc.select("a[href]").map { element -> [String: String] in
            return [
                "text": try element.text(),
                "url": try element.attr("href")
            ]
        }
        result["links"] = links

        // Extract all images
        let images = try doc.select("img").map { element -> [String: String] in
            return [
                "alt": try element.attr("alt"),
                "src": try element.attr("src")
            ]
        }
        result["images"] = images

        // Extract specific content by class or ID
        result["main_content"] = try doc.select(".content, #content, main").first()?.text() ?? ""

        return result
    }
}

enum ScrapingError: Error {
    case invalidURL
    case invalidData
    case parsingFailed
}

Handling Complex Selector Scenarios

Working with Tables

func parseHTMLTable() {
    let tableHTML = """
    <table class="data-table">
        <thead>
            <tr>
                <th>Name</th>
                <th>Age</th>
                <th>City</th>
            </tr>
        </thead>
        <tbody>
            <tr>
                <td>Alice</td>
                <td>25</td>
                <td>New York</td>
            </tr>
            <tr>
                <td>Bob</td>
                <td>30</td>
                <td>Los Angeles</td>
            </tr>
        </tbody>
    </table>
    """

    do {
        let doc = try SwiftSoup.parse(tableHTML)

        // Extract table headers
        let headers = try doc.select("thead th").map { try $0.text() }
        print("Headers: \(headers)")

        // Extract table rows
        let rows = try doc.select("tbody tr")
        var tableData: [[String]] = []

        for row in rows {
            let cells = try row.select("td").map { try $0.text() }
            tableData.append(cells)
        }

        print("Table data: \(tableData)")

    } catch {
        print("Error parsing table: \(error)")
    }
}

Working with Forms

func parseHTMLForm() {
    let formHTML = """
    <form id="contact-form" action="/submit" method="post">
        <input type="text" name="name" placeholder="Your name" required>
        <input type="email" name="email" placeholder="Your email" required>
        <select name="country">
            <option value="us">United States</option>
            <option value="ca">Canada</option>
            <option value="uk">United Kingdom</option>
        </select>
        <textarea name="message" placeholder="Your message"></textarea>
        <button type="submit">Send Message</button>
    </form>
    """

    do {
        let doc = try SwiftSoup.parse(formHTML)

        // Extract form attributes
        let form = try doc.select("#contact-form").first()!
        let action = try form.attr("action")
        let method = try form.attr("method")
        print("Form action: \(action), method: \(method)")

        // Extract input fields
        let inputs = try doc.select("input")
        for input in inputs {
            let type = try input.attr("type")
            let name = try input.attr("name")
            let placeholder = try input.attr("placeholder")
            let required = try input.hasAttr("required")
            print("Input - Type: \(type), Name: \(name), Placeholder: \(placeholder), Required: \(required)")
        }

        // Extract select options
        let options = try doc.select("select[name=country] option")
        for option in options {
            let value = try option.attr("value")
            let text = try option.text()
            print("Option - Value: \(value), Text: \(text)")
        }

    } catch {
        print("Error parsing form: \(error)")
    }
}

Error Handling and Best Practices

Robust Error Handling

class HTMLParser {
    enum ParseError: Error {
        case invalidHTML
        case selectorNotFound
        case extractionFailed
    }

    func safelyExtractContent(from html: String, selector: String) throws -> [String] {
        do {
            let doc = try SwiftSoup.parse(html)
            let elements = try doc.select(selector)

            guard !elements.isEmpty() else {
                throw ParseError.selectorNotFound
            }

            return try elements.map { try $0.text() }

        } catch let error as ParseError {
            throw error
        } catch {
            throw ParseError.extractionFailed
        }
    }

    func extractWithFallback(from html: String, selectors: [String]) -> String? {
        for selector in selectors {
            do {
                let results = try safelyExtractContent(from: html, selector: selector)
                if let first = results.first, !first.isEmpty {
                    return first
                }
            } catch {
                continue
            }
        }
        return nil
    }
}

// Usage example
let parser = HTMLParser()
let fallbackSelectors = ["h1.title", ".title", "h1", "title"]
if let title = parser.extractWithFallback(from: htmlContent, selectors: fallbackSelectors) {
    print("Extracted title: \(title)")
}

Performance Optimization

For large-scale parsing operations, consider these optimization techniques:

class OptimizedHTMLParser {
    private let parseQueue = DispatchQueue(label: "html.parsing", qos: .userInitiated)

    func parseMultipleDocuments(_ htmlStrings: [String], 
                               selector: String,
                               completion: @escaping ([String]) -> Void) {
        parseQueue.async {
            let results = htmlStrings.compactMap { html -> String? in
                do {
                    let doc = try SwiftSoup.parse(html)
                    return try doc.select(selector).first()?.text()
                } catch {
                    return nil
                }
            }

            DispatchQueue.main.async {
                completion(results)
            }
        }
    }

    func streamParse(html: String, 
                    selectors: [String: String],
                    completion: @escaping ([String: String]) -> Void) {
        parseQueue.async {
            var results: [String: String] = [:]

            do {
                let doc = try SwiftSoup.parse(html)

                for (key, selector) in selectors {
                    results[key] = try doc.select(selector).first()?.text() ?? ""
                }

            } catch {
                print("Parsing error: \(error)")
            }

            DispatchQueue.main.async {
                completion(results)
            }
        }
    }
}

Integration with Web Scraping APIs

When working with dynamic content that requires JavaScript execution, similar to how browser automation tools handle dynamic content, you might need to integrate with web scraping APIs:

struct WebScrapingAPIClient {
    private let apiKey: String
    private let baseURL = "https://api.webscraping.ai/html"

    init(apiKey: String) {
        self.apiKey = apiKey
    }

    func scrapeWithRendering(url: String, 
                           waitFor: String? = nil,
                           completion: @escaping (Result<String, Error>) -> Void) {
        var urlComponents = URLComponents(string: baseURL)!
        urlComponents.queryItems = [
            URLQueryItem(name: "api_key", value: apiKey),
            URLQueryItem(name: "url", value: url),
            URLQueryItem(name: "js", value: "true")
        ]

        if let waitFor = waitFor {
            urlComponents.queryItems?.append(URLQueryItem(name: "wait_for", value: waitFor))
        }

        guard let requestURL = urlComponents.url else {
            completion(.failure(ScrapingError.invalidURL))
            return
        }

        URLSession.shared.dataTask(with: requestURL) { data, response, error in
            if let error = error {
                completion(.failure(error))
                return
            }

            guard let data = data,
                  let html = String(data: data, encoding: .utf8) else {
                completion(.failure(ScrapingError.invalidData))
                return
            }

            completion(.success(html))
        }.resume()
    }
}

Testing CSS Selector Parsing

import XCTest

class CSSParsingTests: XCTestCase {
    func testBasicSelectorParsing() {
        let html = "<div class='test'><p id='content'>Hello World</p></div>"

        do {
            let doc = try SwiftSoup.parse(html)
            let content = try doc.select("#content").first()?.text()
            XCTAssertEqual(content, "Hello World")
        } catch {
            XCTFail("Parsing failed: \(error)")
        }
    }

    func testComplexSelectorParsing() {
        let html = """
        <article>
            <header class="article-header">
                <h1>Test Article</h1>
            </header>
            <div class="content">
                <p class="intro">Introduction</p>
            </div>
        </article>
        """

        do {
            let doc = try SwiftSoup.parse(html)
            let title = try doc.select("article header h1").first()?.text()
            let intro = try doc.select(".content .intro").first()?.text()

            XCTAssertEqual(title, "Test Article")
            XCTAssertEqual(intro, "Introduction")
        } catch {
            XCTFail("Complex parsing failed: \(error)")
        }
    }
}

Conclusion

Parsing CSS selectors for HTML content extraction in Swift is efficiently accomplished using SwiftSoup, which provides comprehensive CSS selector support similar to jQuery. The key to successful implementation lies in proper error handling, understanding CSS selector syntax, and optimizing for performance when processing large amounts of data.

For dynamic content that requires JavaScript execution, consider integrating with specialized web scraping APIs that can render JavaScript before returning the HTML content. This approach ensures you can extract data from modern web applications that rely heavily on client-side rendering.

Remember to always respect robots.txt files and website terms of service when implementing web scraping solutions, and consider implementing proper rate limiting and retry mechanisms for production applications.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon