How do I parse CSS selectors for HTML content extraction in Swift?
Parsing CSS selectors for HTML content extraction in Swift requires specialized libraries and techniques since Swift doesn't have built-in HTML parsing capabilities like web browsers. This guide covers the most effective approaches using SwiftSoup, Foundation methods, and custom parsing solutions.
Understanding CSS Selectors in Swift Context
CSS selectors are patterns used to select HTML elements for styling or data extraction. In Swift, you'll need third-party libraries to interpret these selectors and extract content from HTML documents. The most popular and reliable option is SwiftSoup, which provides jQuery-like syntax for HTML parsing.
Setting Up SwiftSoup
SwiftSoup is the most comprehensive HTML parsing library for Swift, offering full CSS selector support.
Installation via Swift Package Manager
Add SwiftSoup to your Package.swift
file:
dependencies: [
.package(url: "https://github.com/scinfu/SwiftSoup.git", from: "2.6.0")
]
Or add it through Xcode: File → Add Package Dependencies → https://github.com/scinfu/SwiftSoup.git
Installation via CocoaPods
Add to your Podfile
:
pod 'SwiftSoup', '~> 2.6.0'
Basic CSS Selector Parsing with SwiftSoup
Here's how to parse HTML and extract content using CSS selectors:
import SwiftSoup
func parseHTMLWithCSSSelectors() {
let html = """
<html>
<body>
<div class="container">
<h1 id="title">Welcome to SwiftSoup</h1>
<p class="description">This is a paragraph with class description.</p>
<ul class="list">
<li data-id="1">Item 1</li>
<li data-id="2">Item 2</li>
<li data-id="3">Item 3</li>
</ul>
<a href="https://example.com" class="external-link">External Link</a>
</div>
</body>
</html>
"""
do {
let doc = try SwiftSoup.parse(html)
// Parse by ID selector
let title = try doc.select("#title").first()?.text() ?? ""
print("Title: \(title)")
// Parse by class selector
let description = try doc.select(".description").first()?.text() ?? ""
print("Description: \(description)")
// Parse by tag selector
let listItems = try doc.select("li")
for item in listItems {
let text = try item.text()
let dataId = try item.attr("data-id")
print("Item: \(text), ID: \(dataId)")
}
// Parse by attribute selector
let externalLink = try doc.select("a[href^=https]").first()
if let link = externalLink {
let href = try link.attr("href")
let linkText = try link.text()
print("External link: \(linkText) -> \(href)")
}
} catch {
print("Error parsing HTML: \(error)")
}
}
Advanced CSS Selector Techniques
Combining Multiple Selectors
func advancedCSSSelectors() {
let html = """
<div class="article">
<header>
<h2 class="title">Article Title</h2>
<span class="author">John Doe</span>
<time datetime="2024-01-15">January 15, 2024</time>
</header>
<section class="content">
<p class="intro">Introduction paragraph</p>
<p>Regular paragraph</p>
<p class="highlight">Important information</p>
</section>
</div>
"""
do {
let doc = try SwiftSoup.parse(html)
// Descendant selector
let articleTitle = try doc.select(".article .title").first()?.text() ?? ""
print("Article title: \(articleTitle)")
// Child selector
let directChildren = try doc.select(".content > p")
print("Direct paragraph children: \(directChildren.count)")
// Adjacent sibling selector
let authorAfterTitle = try doc.select(".title + .author").first()?.text() ?? ""
print("Author: \(authorAfterTitle)")
// Attribute contains selector
let timeElement = try doc.select("time[datetime*='2024']").first()
if let time = timeElement {
let datetime = try time.attr("datetime")
let text = try time.text()
print("Time: \(text) (\(datetime))")
}
// Pseudo-selector equivalents
let firstParagraph = try doc.select(".content p").first()?.text() ?? ""
let lastParagraph = try doc.select(".content p").last()?.text() ?? ""
print("First paragraph: \(firstParagraph)")
print("Last paragraph: \(lastParagraph)")
} catch {
print("Error: \(error)")
}
}
Web Scraping with CSS Selectors
For real-world web scraping scenarios, you'll need to fetch HTML from URLs and then parse it:
import Foundation
class WebScraper {
func scrapeWebpage(url: String, completion: @escaping (Result<[String: Any], Error>) -> Void) {
guard let url = URL(string: url) else {
completion(.failure(ScrapingError.invalidURL))
return
}
let task = URLSession.shared.dataTask(with: url) { data, response, error in
if let error = error {
completion(.failure(error))
return
}
guard let data = data,
let htmlString = String(data: data, encoding: .utf8) else {
completion(.failure(ScrapingError.invalidData))
return
}
do {
let extractedData = try self.parseHTMLContent(htmlString)
completion(.success(extractedData))
} catch {
completion(.failure(error))
}
}
task.resume()
}
private func parseHTMLContent(_ html: String) throws -> [String: Any] {
let doc = try SwiftSoup.parse(html)
var result: [String: Any] = [:]
// Extract page title
result["title"] = try doc.select("title").first()?.text() ?? ""
// Extract meta description
result["description"] = try doc.select("meta[name=description]").first()?.attr("content") ?? ""
// Extract all links
let links = try doc.select("a[href]").map { element -> [String: String] in
return [
"text": try element.text(),
"url": try element.attr("href")
]
}
result["links"] = links
// Extract all images
let images = try doc.select("img").map { element -> [String: String] in
return [
"alt": try element.attr("alt"),
"src": try element.attr("src")
]
}
result["images"] = images
// Extract specific content by class or ID
result["main_content"] = try doc.select(".content, #content, main").first()?.text() ?? ""
return result
}
}
enum ScrapingError: Error {
case invalidURL
case invalidData
case parsingFailed
}
Handling Complex Selector Scenarios
Working with Tables
func parseHTMLTable() {
let tableHTML = """
<table class="data-table">
<thead>
<tr>
<th>Name</th>
<th>Age</th>
<th>City</th>
</tr>
</thead>
<tbody>
<tr>
<td>Alice</td>
<td>25</td>
<td>New York</td>
</tr>
<tr>
<td>Bob</td>
<td>30</td>
<td>Los Angeles</td>
</tr>
</tbody>
</table>
"""
do {
let doc = try SwiftSoup.parse(tableHTML)
// Extract table headers
let headers = try doc.select("thead th").map { try $0.text() }
print("Headers: \(headers)")
// Extract table rows
let rows = try doc.select("tbody tr")
var tableData: [[String]] = []
for row in rows {
let cells = try row.select("td").map { try $0.text() }
tableData.append(cells)
}
print("Table data: \(tableData)")
} catch {
print("Error parsing table: \(error)")
}
}
Working with Forms
func parseHTMLForm() {
let formHTML = """
<form id="contact-form" action="/submit" method="post">
<input type="text" name="name" placeholder="Your name" required>
<input type="email" name="email" placeholder="Your email" required>
<select name="country">
<option value="us">United States</option>
<option value="ca">Canada</option>
<option value="uk">United Kingdom</option>
</select>
<textarea name="message" placeholder="Your message"></textarea>
<button type="submit">Send Message</button>
</form>
"""
do {
let doc = try SwiftSoup.parse(formHTML)
// Extract form attributes
let form = try doc.select("#contact-form").first()!
let action = try form.attr("action")
let method = try form.attr("method")
print("Form action: \(action), method: \(method)")
// Extract input fields
let inputs = try doc.select("input")
for input in inputs {
let type = try input.attr("type")
let name = try input.attr("name")
let placeholder = try input.attr("placeholder")
let required = try input.hasAttr("required")
print("Input - Type: \(type), Name: \(name), Placeholder: \(placeholder), Required: \(required)")
}
// Extract select options
let options = try doc.select("select[name=country] option")
for option in options {
let value = try option.attr("value")
let text = try option.text()
print("Option - Value: \(value), Text: \(text)")
}
} catch {
print("Error parsing form: \(error)")
}
}
Error Handling and Best Practices
Robust Error Handling
class HTMLParser {
enum ParseError: Error {
case invalidHTML
case selectorNotFound
case extractionFailed
}
func safelyExtractContent(from html: String, selector: String) throws -> [String] {
do {
let doc = try SwiftSoup.parse(html)
let elements = try doc.select(selector)
guard !elements.isEmpty() else {
throw ParseError.selectorNotFound
}
return try elements.map { try $0.text() }
} catch let error as ParseError {
throw error
} catch {
throw ParseError.extractionFailed
}
}
func extractWithFallback(from html: String, selectors: [String]) -> String? {
for selector in selectors {
do {
let results = try safelyExtractContent(from: html, selector: selector)
if let first = results.first, !first.isEmpty {
return first
}
} catch {
continue
}
}
return nil
}
}
// Usage example
let parser = HTMLParser()
let fallbackSelectors = ["h1.title", ".title", "h1", "title"]
if let title = parser.extractWithFallback(from: htmlContent, selectors: fallbackSelectors) {
print("Extracted title: \(title)")
}
Performance Optimization
For large-scale parsing operations, consider these optimization techniques:
class OptimizedHTMLParser {
private let parseQueue = DispatchQueue(label: "html.parsing", qos: .userInitiated)
func parseMultipleDocuments(_ htmlStrings: [String],
selector: String,
completion: @escaping ([String]) -> Void) {
parseQueue.async {
let results = htmlStrings.compactMap { html -> String? in
do {
let doc = try SwiftSoup.parse(html)
return try doc.select(selector).first()?.text()
} catch {
return nil
}
}
DispatchQueue.main.async {
completion(results)
}
}
}
func streamParse(html: String,
selectors: [String: String],
completion: @escaping ([String: String]) -> Void) {
parseQueue.async {
var results: [String: String] = [:]
do {
let doc = try SwiftSoup.parse(html)
for (key, selector) in selectors {
results[key] = try doc.select(selector).first()?.text() ?? ""
}
} catch {
print("Parsing error: \(error)")
}
DispatchQueue.main.async {
completion(results)
}
}
}
}
Integration with Web Scraping APIs
When working with dynamic content that requires JavaScript execution, similar to how browser automation tools handle dynamic content, you might need to integrate with web scraping APIs:
struct WebScrapingAPIClient {
private let apiKey: String
private let baseURL = "https://api.webscraping.ai/html"
init(apiKey: String) {
self.apiKey = apiKey
}
func scrapeWithRendering(url: String,
waitFor: String? = nil,
completion: @escaping (Result<String, Error>) -> Void) {
var urlComponents = URLComponents(string: baseURL)!
urlComponents.queryItems = [
URLQueryItem(name: "api_key", value: apiKey),
URLQueryItem(name: "url", value: url),
URLQueryItem(name: "js", value: "true")
]
if let waitFor = waitFor {
urlComponents.queryItems?.append(URLQueryItem(name: "wait_for", value: waitFor))
}
guard let requestURL = urlComponents.url else {
completion(.failure(ScrapingError.invalidURL))
return
}
URLSession.shared.dataTask(with: requestURL) { data, response, error in
if let error = error {
completion(.failure(error))
return
}
guard let data = data,
let html = String(data: data, encoding: .utf8) else {
completion(.failure(ScrapingError.invalidData))
return
}
completion(.success(html))
}.resume()
}
}
Testing CSS Selector Parsing
import XCTest
class CSSParsingTests: XCTestCase {
func testBasicSelectorParsing() {
let html = "<div class='test'><p id='content'>Hello World</p></div>"
do {
let doc = try SwiftSoup.parse(html)
let content = try doc.select("#content").first()?.text()
XCTAssertEqual(content, "Hello World")
} catch {
XCTFail("Parsing failed: \(error)")
}
}
func testComplexSelectorParsing() {
let html = """
<article>
<header class="article-header">
<h1>Test Article</h1>
</header>
<div class="content">
<p class="intro">Introduction</p>
</div>
</article>
"""
do {
let doc = try SwiftSoup.parse(html)
let title = try doc.select("article header h1").first()?.text()
let intro = try doc.select(".content .intro").first()?.text()
XCTAssertEqual(title, "Test Article")
XCTAssertEqual(intro, "Introduction")
} catch {
XCTFail("Complex parsing failed: \(error)")
}
}
}
Conclusion
Parsing CSS selectors for HTML content extraction in Swift is efficiently accomplished using SwiftSoup, which provides comprehensive CSS selector support similar to jQuery. The key to successful implementation lies in proper error handling, understanding CSS selector syntax, and optimizing for performance when processing large amounts of data.
For dynamic content that requires JavaScript execution, consider integrating with specialized web scraping APIs that can render JavaScript before returning the HTML content. This approach ensures you can extract data from modern web applications that rely heavily on client-side rendering.
Remember to always respect robots.txt files and website terms of service when implementing web scraping solutions, and consider implementing proper rate limiting and retry mechanisms for production applications.