How do I extract links from HTML using SwiftSoup?

SwiftSoup is a powerful Swift library that provides HTML parsing capabilities similar to JSoup for Java. Extracting links from HTML documents is one of the most common web scraping tasks, and SwiftSoup makes this process straightforward with its CSS selector support and DOM traversal methods.

What is SwiftSoup?

SwiftSoup is a pure Swift HTML parser that allows you to parse, traverse, and manipulate HTML documents. It provides a familiar API for developers who have worked with JSoup or other HTML parsing libraries, making it easy to extract specific elements like links from web pages.

Basic Link Extraction

Installing SwiftSoup

First, add SwiftSoup to your project using Swift Package Manager:

dependencies: [
    .package(url: "https://github.com/scinfu/SwiftSoup.git", from: "2.6.0")
]

Simple Link Extraction

Here's how to extract all links from an HTML document:

import SwiftSoup

do {
    let html = """
    <html>
        <body>
            <a href="https://example.com">Example Link</a>
            <a href="/relative-link">Relative Link</a>
            <a href="mailto:test@example.com">Email Link</a>
        </body>
    </html>
    """

    let doc = try SwiftSoup.parse(html)
    let links = try doc.select("a[href]")

    for link in links {
        let url = try link.attr("href")
        let text = try link.text()
        print("URL: \(url), Text: \(text)")
    }
} catch {
    print("Error parsing HTML: \(error)")
}

This code will output: URL: https://example.com, Text: Example Link URL: /relative-link, Text: Relative Link URL: mailto:test@example.com, Text: Email Link

Advanced Link Extraction Techniques

Extracting Specific Link Types

You can filter links based on their attributes or content:

// Extract only external links (HTTP/HTTPS)
let externalLinks = try doc.select("a[href^=http]")

// Extract only internal/relative links
let internalLinks = try doc.select("a[href^=/], a[href^=./]")

// Extract email links
let emailLinks = try doc.select("a[href^=mailto:]")

// Extract links with specific CSS classes
let specialLinks = try doc.select("a.special-link[href]")

Extracting Link Attributes

Beyond the href attribute, you might need other link properties:

for link in links {
    let href = try link.attr("href")
    let title = try link.attr("title")
    let target = try link.attr("target")
    let rel = try link.attr("rel")
    let text = try link.text()

    print("Link: \(href)")
    print("Title: \(title)")
    print("Target: \(target)")
    print("Rel: \(rel)")
    print("Text: \(text)")
    print("---")
}

Building Absolute URLs

When dealing with relative links, you'll often need to convert them to absolute URLs:

func extractLinksWithBaseURL(html: String, baseURL: String) throws -> [(url: String, text: String)] {
    let doc = try SwiftSoup.parse(html)
    try doc.setBaseUri(baseURL)

    let links = try doc.select("a[href]")
    var extractedLinks: [(url: String, text: String)] = []

    for link in links {
        let absoluteURL = try link.attr("abs:href")
        let text = try link.text()
        extractedLinks.append((url: absoluteURL, text: text))
    }

    return extractedLinks
}

// Usage
let html = "<a href='/page1'>Page 1</a><a href='../page2'>Page 2</a>"
let links = try extractLinksWithBaseURL(html: html, baseURL: "https://example.com/folder/")

Working with Complex HTML Structures

Extracting Links from Specific Sections

You can target links within specific HTML sections:

// Extract links from navigation
let navLinks = try doc.select("nav a[href]")

// Extract links from the main content area
let contentLinks = try doc.select("main a[href], .content a[href]")

// Extract links from footer
let footerLinks = try doc.select("footer a[href]")

// Extract links from a specific div
let sidebarLinks = try doc.select("div.sidebar a[href]")

Handling Link Collections and Menus

For structured link collections like menus or lists:

struct LinkInfo {
    let url: String
    let text: String
    let isExternal: Bool
    let hasTitle: Bool
}

func extractStructuredLinks(from html: String) throws -> [LinkInfo] {
    let doc = try SwiftSoup.parse(html)
    let links = try doc.select("a[href]")

    return try links.compactMap { link -> LinkInfo? in
        let href = try link.attr("href")
        let text = try link.text().trimmingCharacters(in: .whitespacesAndNewlines)
        let title = try link.attr("title")

        guard !href.isEmpty && !text.isEmpty else { return nil }

        let isExternal = href.starts(with: "http://") || href.starts(with: "https://")
        let hasTitle = !title.isEmpty

        return LinkInfo(url: href, text: text, isExternal: isExternal, hasTitle: hasTitle)
    }
}

Error Handling and Validation

Robust Link Extraction with Error Handling

func safeExtractLinks(from html: String) -> [(url: String, text: String)] {
    var extractedLinks: [(url: String, text: String)] = []

    do {
        let doc = try SwiftSoup.parse(html)
        let links = try doc.select("a[href]")

        for link in links {
            do {
                let href = try link.attr("href")
                let text = try link.text()

                // Validate URL format
                if isValidURL(href) {
                    extractedLinks.append((url: href, text: text))
                }
            } catch {
                print("Error extracting individual link: \(error)")
                continue
            }
        }
    } catch {
        print("Error parsing HTML: \(error)")
    }

    return extractedLinks
}

func isValidURL(_ string: String) -> Bool {
    guard let url = URL(string: string) else { return false }
    return url.scheme != nil || string.starts(with: "/") || string.starts(with: "./")
}

Real-World Example: Web Scraping with Link Extraction

Here's a complete example that fetches a web page and extracts its links:

import Foundation
import SwiftSoup

func scrapeLinksFromURL(_ urlString: String) async throws -> [LinkInfo] {
    guard let url = URL(string: urlString) else {
        throw URLError(.badURL)
    }

    let (data, _) = try await URLSession.shared.data(from: url)
    let html = String(data: data, encoding: .utf8) ?? ""

    let doc = try SwiftSoup.parse(html)
    try doc.setBaseUri(urlString)

    let links = try doc.select("a[href]")
    var extractedLinks: [LinkInfo] = []

    for link in links {
        let href = try link.attr("abs:href")
        let text = try link.text().trimmingCharacters(in: .whitespacesAndNewlines)

        guard !href.isEmpty && !text.isEmpty else { continue }

        let isExternal = !href.starts(with: url.absoluteString)
        let hasTitle = !try link.attr("title").isEmpty

        extractedLinks.append(LinkInfo(url: href, text: text, isExternal: isExternal, hasTitle: hasTitle))
    }

    return extractedLinks
}

// Usage
Task {
    do {
        let links = try await scrapeLinksFromURL("https://example.com")
        for link in links {
            print("\(link.text): \(link.url)")
        }
    } catch {
        print("Scraping failed: \(error)")
    }
}

CSS Selectors for Link Extraction

SwiftSoup supports powerful CSS selectors for precise link targeting:

// Links with specific attributes
let downloadLinks = try doc.select("a[download]")
let externalLinks = try doc.select("a[href^='http']:not([href*='yourdomain.com'])")

// Links in specific positions
let firstLink = try doc.select("a:first-child")
let lastLink = try doc.select("a:last-child")
let evenLinks = try doc.select("a:nth-child(even)")

// Links containing specific text
let contactLinks = try doc.select("a:contains('Contact')")
let aboutLinks = try doc.select("a[href*='about']")

Handling Different Link Types

JavaScript Links

// Extract JavaScript onclick handlers
let jsLinks = try doc.select("a[onclick]")
for link in jsLinks {
    let onclick = try link.attr("onclick")
    print("JavaScript: \(onclick)")
}

Image Links

// Extract links that contain images
let imageLinks = try doc.select("a:has(img)")
for link in imageLinks {
    let href = try link.attr("href")
    let imgSrc = try link.select("img").attr("src")
    print("Image link: \(href), Image: \(imgSrc)")
}

Performance Optimization

Efficient Link Processing

For large HTML documents, consider these optimization techniques:

func efficientLinkExtraction(html: String, maxLinks: Int = 100) throws -> [(url: String, text: String)] {
    let doc = try SwiftSoup.parse(html)
    let links = try doc.select("a[href]")

    var extractedLinks: [(url: String, text: String)] = []
    extractedLinks.reserveCapacity(min(links.count, maxLinks))

    for (index, link) in links.enumerated() {
        if index >= maxLinks { break }

        let href = try link.attr("href")
        let text = try link.text()

        if !href.isEmpty {
            extractedLinks.append((url: href, text: text))
        }
    }

    return extractedLinks
}

Integration with Networking Libraries

Using URLSession with SwiftSoup

extension URLSession {
    func extractLinksFromURL(_ url: URL) async throws -> [LinkInfo] {
        let (data, _) = try await data(from: url)
        let html = String(data: data, encoding: .utf8) ?? ""
        return try extractStructuredLinks(from: html)
    }
}

Alamofire Integration

If you're using Alamofire for networking, you can combine it with SwiftSoup:

import Alamofire

AF.request("https://example.com")
    .responseString { response in
        switch response.result {
        case .success(let html):
            do {
                let links = try extractStructuredLinks(from: html)
                print("Extracted \(links.count) links")
            } catch {
                print("Parsing error: \(error)")
            }
        case .failure(let error):
            print("Network error: \(error)")
        }
    }

Best Practices and Tips

1. Always Handle Exceptions

SwiftSoup methods can throw exceptions, so always wrap them in try-catch blocks.

2. Use Appropriate Selectors

Choose the most specific CSS selectors to avoid extracting unwanted elements.

3. Validate URLs

Always validate extracted URLs before using them, especially when dealing with user-generated content.

4. Consider Base URLs

When working with relative URLs, always set a base URL for proper resolution.

5. Memory Management

For large documents, process links in batches to avoid memory issues.

6. Rate Limiting

When scraping multiple pages, implement proper rate limiting to avoid being blocked.

Common Challenges and Solutions

Handling Empty or Invalid Links

func cleanLinks(_ links: [(url: String, text: String)]) -> [(url: String, text: String)] {
    return links.filter { link in
        !link.url.isEmpty && 
        !link.url.hasPrefix("#") && 
        !link.url.hasPrefix("javascript:")
    }
}

Dealing with Encoded URLs

func decodeURL(_ urlString: String) -> String {
    return urlString.removingPercentEncoding ?? urlString
}

Integration with Web Scraping APIs

While SwiftSoup is excellent for client-side HTML parsing, for production web scraping applications, you might want to combine it with robust web scraping services. Modern scraping APIs can handle JavaScript-rendered content and anti-bot protection, which SwiftSoup alone cannot manage since it only parses static HTML.

For comprehensive web scraping solutions that handle dynamic content and avoid detection mechanisms, consider using specialized web scraping APIs alongside SwiftSoup for local HTML processing tasks.

Advanced Use Cases

Building a Link Crawler

class LinkCrawler {
    private var visitedURLs = Set<String>()
    private var foundLinks: [LinkInfo] = []

    func crawl(startingURL: String, maxDepth: Int = 2) async throws {
        try await crawlRecursive(url: startingURL, depth: 0, maxDepth: maxDepth)
    }

    private func crawlRecursive(url: String, depth: Int, maxDepth: Int) async throws {
        guard depth <= maxDepth, !visitedURLs.contains(url) else { return }

        visitedURLs.insert(url)
        let links = try await scrapeLinksFromURL(url)
        foundLinks.append(contentsOf: links)

        // Crawl internal links
        for link in links where !link.isExternal && depth < maxDepth {
            try await crawlRecursive(url: link.url, depth: depth + 1, maxDepth: maxDepth)
        }
    }
}

Conclusion

SwiftSoup provides a powerful and flexible way to extract links from HTML documents in Swift applications. Whether you're building a simple link checker or a complex web crawler, SwiftSoup's CSS selector support and DOM traversal methods make link extraction straightforward and efficient.

Remember to handle errors appropriately, validate extracted URLs, and consider using absolute URLs when working with relative links. With these techniques, you can build robust link extraction functionality for your Swift applications.

The combination of SwiftSoup's parsing capabilities with proper error handling and validation creates a solid foundation for any link extraction task, from simple one-off scripts to production-grade web scraping applications.

Table of contents