How to Select Elements That Contain Specific Text in SwiftSoup
SwiftSoup is a powerful Swift library for parsing and manipulating HTML documents, providing similar functionality to Java's Jsoup library. One of the most common tasks when working with HTML is selecting elements based on their text content. This comprehensive guide will show you various methods to select elements that contain specific text using SwiftSoup.
Understanding Text-Based Element Selection
SwiftSoup offers several approaches to select elements based on their text content. The main methods include using CSS selectors with the :contains()
pseudo-class and utilizing SwiftSoup's built-in methods for text matching.
Basic Text Selection with :contains()
The most straightforward way to select elements containing specific text is using the CSS :contains()
pseudo-selector:
import SwiftSoup
do {
let html = """
<html>
<body>
<div>Welcome to our website</div>
<p>This paragraph contains important information</p>
<div>Another div with different content</div>
<span>Welcome message here</span>
</body>
</html>
"""
let doc = try SwiftSoup.parse(html)
// Select all elements containing "Welcome"
let welcomeElements = try doc.select(":contains(Welcome)")
for element in welcomeElements {
print("Found: \(try element.text())")
print("Tag: \(element.tagName())")
}
} catch Exception.Error(let type, let message) {
print("Error: \(type) - \(message)")
} catch {
print("Unknown error occurred")
}
Case-Sensitive vs Case-Insensitive Matching
By default, SwiftSoup's :contains()
selector is case-sensitive. For case-insensitive matching, you'll need to implement additional logic:
import SwiftSoup
func selectElementsContainingTextIgnoreCase(_ doc: Document, _ text: String) throws -> Elements {
let allElements = try doc.select("*")
var matchingElements = Elements()
for element in allElements {
let elementText = try element.text().lowercased()
if elementText.contains(text.lowercased()) {
try matchingElements.add(element)
}
}
return matchingElements
}
// Usage example
do {
let html = "<div>HELLO World</div><p>hello there</p><span>Hi HELLO</span>"
let doc = try SwiftSoup.parse(html)
let elements = try selectElementsContainingTextIgnoreCase(doc, "hello")
for element in elements {
print("Found: \(try element.text())")
}
} catch {
print("Error: \(error)")
}
Advanced Text Selection Techniques
Selecting Elements with Exact Text Matches
Sometimes you need elements that contain exactly the specified text, not just as a substring:
import SwiftSoup
func selectElementsWithExactText(_ doc: Document, _ exactText: String) throws -> Elements {
let allElements = try doc.select("*")
var matchingElements = Elements()
for element in allElements {
let elementText = try element.text().trimmingCharacters(in: .whitespacesAndNewlines)
if elementText == exactText {
try matchingElements.add(element)
}
}
return matchingElements
}
// Example usage
do {
let html = """
<div>Contact Us</div>
<p>Please Contact Us for more information</p>
<button>Contact Us</button>
"""
let doc = try SwiftSoup.parse(html)
let exactMatches = try selectElementsWithExactText(doc, "Contact Us")
for element in exactMatches {
print("Exact match found: \(element.tagName()) - \(try element.text())")
}
} catch {
print("Error: \(error)")
}
Combining Text Selection with Other Selectors
You can combine text-based selection with other CSS selectors for more precise targeting:
import SwiftSoup
do {
let html = """
<div class="content">
<h1>Important Announcement</h1>
<p>This is an important message</p>
<div class="sidebar">
<h2>Important Links</h2>
<p>Some sidebar content</p>
</div>
</div>
"""
let doc = try SwiftSoup.parse(html)
// Select paragraphs containing "important" (case-insensitive)
let importantParagraphs = try doc.select("p:contains(important)")
// Select headings in sidebar containing "Important"
let sidebarHeadings = try doc.select(".sidebar h2:contains(Important)")
// Select any element with class "content" containing "Announcement"
let contentWithAnnouncement = try doc.select(".content:contains(Announcement)")
print("Important paragraphs: \(importantParagraphs.size())")
print("Sidebar headings: \(sidebarHeadings.size())")
print("Content with announcement: \(contentWithAnnouncement.size())")
} catch {
print("Error: \(error)")
}
Working with Own Text vs. All Text
SwiftSoup distinguishes between an element's own text and all text (including child elements):
import SwiftSoup
do {
let html = """
<div>
Parent text
<span>Child text</span>
More parent text
</div>
"""
let doc = try SwiftSoup.parse(html)
let divElement = try doc.select("div").first()!
// Get all text (including children)
let allText = try divElement.text()
print("All text: \(allText)")
// Get only direct text (own text)
let ownText = try divElement.ownText()
print("Own text: \(ownText)")
// Select based on own text only
let elementsWithOwnText = try doc.select("*").filter { element in
let ownText = try element.ownText().trimmingCharacters(in: .whitespacesAndNewlines)
return ownText.contains("Parent")
}
for element in elementsWithOwnText {
print("Element with own text: \(element.tagName())")
}
} catch {
print("Error: \(error)")
}
Pattern Matching and Regular Expressions
For more complex text matching scenarios, you can implement pattern-based selection:
import SwiftSoup
import Foundation
func selectElementsMatchingPattern(_ doc: Document, _ pattern: String) throws -> Elements {
let regex = try NSRegularExpression(pattern: pattern, options: .caseInsensitive)
let allElements = try doc.select("*")
var matchingElements = Elements()
for element in allElements {
let elementText = try element.text()
let range = NSRange(location: 0, length: elementText.utf16.count)
if regex.firstMatch(in: elementText, options: [], range: range) != nil {
try matchingElements.add(element)
}
}
return matchingElements
}
// Example: Select elements containing email addresses
do {
let html = """
<div>Contact us at support@example.com</div>
<p>Email john.doe@company.org for details</p>
<span>No email here</span>
<div>Another email: admin@site.net</div>
"""
let doc = try SwiftSoup.parse(html)
let emailPattern = "[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}"
let elementsWithEmails = try selectElementsMatchingPattern(doc, emailPattern)
for element in elementsWithEmails {
print("Element with email: \(try element.text())")
}
} catch {
print("Error: \(error)")
}
Performance Considerations and Best Practices
Optimizing Text-Based Selections
When working with large HTML documents, text-based selections can be expensive. Here are some optimization strategies:
import SwiftSoup
func optimizedTextSelection(_ doc: Document, _ searchText: String, _ tagFilter: String? = nil) throws -> Elements {
// First, narrow down the search space if possible
let searchScope = try doc.select(tagFilter ?? "*")
var results = Elements()
// Use early termination for better performance
for element in searchScope {
let text = try element.text()
if text.localizedCaseInsensitiveContains(searchText) {
try results.add(element)
}
}
return results
}
// Usage example
do {
let html = """
<html>
<body>
<div class="content">
<p>This is important information</p>
<p>Regular paragraph</p>
<p>Another important note</p>
</div>
<footer>
<p>Footer content</p>
</footer>
</body>
</html>
"""
let doc = try SwiftSoup.parse(html)
// Optimize by searching only within content div paragraphs
let importantParagraphs = try optimizedTextSelection(doc, "important", ".content p")
for element in importantParagraphs {
print("Found: \(try element.text())")
}
} catch {
print("Error: \(error)")
}
Error Handling and Edge Cases
When selecting elements by text content, it's crucial to handle various edge cases:
import SwiftSoup
func robustTextSelection(_ htmlString: String, _ searchText: String) -> [String] {
var results: [String] = []
do {
guard !htmlString.isEmpty && !searchText.isEmpty else {
print("Warning: Empty HTML or search text provided")
return results
}
let doc = try SwiftSoup.parse(htmlString)
let elements = try doc.select(":contains(\(searchText))")
for element in elements {
let text = try element.text().trimmingCharacters(in: .whitespacesAndNewlines)
if !text.isEmpty {
results.append(text)
}
}
} catch Exception.Error(let type, let message) {
print("SwiftSoup Error: \(type) - \(message)")
} catch {
print("Unexpected error: \(error.localizedDescription)")
}
return results
}
// Test with various edge cases
let testCases = [
"<div></div>", // Empty elements
"<p> </p>", // Whitespace only
"<span>Normal text</span>", // Normal case
"", // Empty HTML
"<div>Test&Example</div>" // HTML entities
]
for (index, testHtml) in testCases.enumerated() {
let results = robustTextSelection(testHtml, "Test")
print("Test case \(index + 1): \(results)")
}
Integration with Modern iOS Development
When building iOS applications that require web scraping or HTML parsing, SwiftSoup integrates well with modern Swift patterns:
import SwiftSoup
import Combine
class HTMLTextExtractor {
func findElementsContaining(_ text: String, in html: String) -> AnyPublisher<[String], Error> {
return Future { promise in
DispatchQueue.global(qos: .background).async {
do {
let doc = try SwiftSoup.parse(html)
let elements = try doc.select(":contains(\(text))")
let texts = try elements.compactMap { element in
try element.text().trimmingCharacters(in: .whitespacesAndNewlines)
}.filter { !$0.isEmpty }
DispatchQueue.main.async {
promise(.success(texts))
}
} catch {
DispatchQueue.main.async {
promise(.failure(error))
}
}
}
}
.eraseToAnyPublisher()
}
}
// Usage in a SwiftUI view or view controller
let extractor = HTMLTextExtractor()
extractor.findElementsContaining("important", in: htmlContent)
.sink(
receiveCompletion: { completion in
switch completion {
case .finished:
print("Extraction completed")
case .failure(let error):
print("Error: \(error)")
}
},
receiveValue: { texts in
print("Found texts: \(texts)")
}
)
Working with Dynamic Content
When dealing with content that might be loaded dynamically, it's important to understand the limitations of HTML parsing libraries like SwiftSoup. Unlike browser-based solutions that can execute JavaScript, SwiftSoup only works with static HTML content. For cases where you need to handle dynamic content that loads after page load, you might need to combine SwiftSoup with other techniques or use JavaScript-based solutions.
Advanced SwiftSoup Text Selection Patterns
Selecting Elements by Text Length
Sometimes you need to select elements based on the length of their text content:
import SwiftSoup
func selectElementsByTextLength(_ doc: Document, minLength: Int, maxLength: Int? = nil) throws -> Elements {
let allElements = try doc.select("*")
var matchingElements = Elements()
for element in allElements {
let text = try element.ownText().trimmingCharacters(in: .whitespacesAndNewlines)
let length = text.count
if length >= minLength {
if let maxLength = maxLength {
if length <= maxLength {
try matchingElements.add(element)
}
} else {
try matchingElements.add(element)
}
}
}
return matchingElements
}
// Example: Find elements with text between 10 and 50 characters
do {
let html = """
<div>Short</div>
<p>This is a medium length paragraph that should be selected.</p>
<span>This is a very long text content that exceeds the maximum character limit we've set for our selection criteria.</span>
"""
let doc = try SwiftSoup.parse(html)
let mediumTextElements = try selectElementsByTextLength(doc, minLength: 10, maxLength: 50)
for element in mediumTextElements {
print("Medium text: \(try element.text())")
}
} catch {
print("Error: \(error)")
}
Combining Multiple Text Criteria
You can create more sophisticated selection logic by combining multiple text-based criteria:
import SwiftSoup
struct TextSelectionCriteria {
let containsText: String?
let startsWithText: String?
let endsWithText: String?
let minLength: Int?
let maxLength: Int?
let caseInsensitive: Bool
init(contains: String? = nil, startsWith: String? = nil, endsWith: String? = nil,
minLength: Int? = nil, maxLength: Int? = nil, caseInsensitive: Bool = true) {
self.containsText = contains
self.startsWithText = startsWith
self.endsWithText = endsWith
self.minLength = minLength
self.maxLength = maxLength
self.caseInsensitive = caseInsensitive
}
}
func selectElementsByCriteria(_ doc: Document, criteria: TextSelectionCriteria) throws -> Elements {
let allElements = try doc.select("*")
var matchingElements = Elements()
for element in allElements {
var text = try element.text().trimmingCharacters(in: .whitespacesAndNewlines)
if criteria.caseInsensitive {
text = text.lowercased()
}
var matches = true
// Check contains criteria
if let containsText = criteria.containsText {
let searchText = criteria.caseInsensitive ? containsText.lowercased() : containsText
if !text.contains(searchText) {
matches = false
}
}
// Check starts with criteria
if let startsWithText = criteria.startsWithText {
let searchText = criteria.caseInsensitive ? startsWithText.lowercased() : startsWithText
if !text.hasPrefix(searchText) {
matches = false
}
}
// Check ends with criteria
if let endsWithText = criteria.endsWithText {
let searchText = criteria.caseInsensitive ? endsWithText.lowercased() : endsWithText
if !text.hasSuffix(searchText) {
matches = false
}
}
// Check length criteria
if let minLength = criteria.minLength, text.count < minLength {
matches = false
}
if let maxLength = criteria.maxLength, text.count > maxLength {
matches = false
}
if matches {
try matchingElements.add(element)
}
}
return matchingElements
}
For more complex scraping scenarios involving JavaScript-heavy websites, you might want to explore how to handle AJAX requests using Puppeteer when building comprehensive web scraping solutions.
Conclusion
Selecting elements by text content in SwiftSoup is a powerful technique for HTML parsing and web scraping in iOS applications. Whether you need simple text matching with the :contains()
selector or more complex pattern-based selection, SwiftSoup provides the tools necessary for effective HTML manipulation.
Remember to consider performance implications when working with large documents, handle edge cases properly, and leverage Swift's modern language features for cleaner, more maintainable code. By combining SwiftSoup's text selection capabilities with proper error handling and optimization techniques, you can build robust HTML parsing solutions for your iOS applications.
The key to successful text-based element selection lies in understanding your specific use case and choosing the appropriate method—whether it's simple substring matching, exact text matching, pattern-based selection, or complex multi-criteria filtering. With these techniques in your toolkit, you'll be well-equipped to extract the precise data you need from HTML documents in your Swift applications.