Table of contents

How do I extract meta tag content using SwiftSoup?

SwiftSoup is a powerful HTML parsing library for Swift that allows developers to extract and manipulate HTML content with ease. One of the most common use cases is extracting meta tag content for SEO analysis, social media integration, or general metadata processing. This guide provides comprehensive examples and best practices for extracting meta tags using SwiftSoup.

Understanding Meta Tags

Meta tags are HTML elements that provide metadata about a web page. They're typically found in the <head> section and contain information like page descriptions, keywords, author details, and social media sharing data. Common meta tags include:

  • <meta name="description" content="Page description">
  • <meta name="keywords" content="keyword1, keyword2">
  • <meta property="og:title" content="Open Graph title">
  • <meta name="viewport" content="width=device-width, initial-scale=1">

Installing SwiftSoup

Before extracting meta tags, ensure SwiftSoup is properly installed in your project:

Using Swift Package Manager

dependencies: [
    .package(url: "https://github.com/scinfu/SwiftSoup.git", from: "2.6.0")
]

Using CocoaPods

pod 'SwiftSoup', '~> 2.6.0'

Basic Meta Tag Extraction

Here's how to extract meta tag content using SwiftSoup's selector methods:

import SwiftSoup

func extractMetaTags(from html: String) throws {
    let doc = try SwiftSoup.parse(html)

    // Extract meta description
    if let descriptionElement = try doc.select("meta[name=description]").first() {
        let description = try descriptionElement.attr("content")
        print("Description: \(description)")
    }

    // Extract meta keywords
    if let keywordsElement = try doc.select("meta[name=keywords]").first() {
        let keywords = try keywordsElement.attr("content")
        print("Keywords: \(keywords)")
    }

    // Extract viewport meta tag
    if let viewportElement = try doc.select("meta[name=viewport]").first() {
        let viewport = try viewportElement.attr("content")
        print("Viewport: \(viewport)")
    }
}

Advanced Meta Tag Extraction Techniques

Extracting Open Graph Meta Tags

Open Graph meta tags are essential for social media sharing. Here's how to extract them:

func extractOpenGraphTags(from html: String) throws -> [String: String] {
    let doc = try SwiftSoup.parse(html)
    var ogTags: [String: String] = [:]

    let ogElements = try doc.select("meta[property^=og:]")

    for element in ogElements {
        let property = try element.attr("property")
        let content = try element.attr("content")
        ogTags[property] = content
    }

    return ogTags
}

// Usage example
let html = """
<!DOCTYPE html>
<html>
<head>
    <meta property="og:title" content="Amazing Swift Tutorial">
    <meta property="og:description" content="Learn SwiftSoup with examples">
    <meta property="og:image" content="https://example.com/image.jpg">
    <meta property="og:url" content="https://example.com/tutorial">
</head>
<body></body>
</html>
"""

do {
    let ogTags = try extractOpenGraphTags(from: html)
    for (property, content) in ogTags {
        print("\(property): \(content)")
    }
} catch {
    print("Error: \(error)")
}

Extracting Twitter Card Meta Tags

Twitter Card meta tags require similar handling:

func extractTwitterCardTags(from html: String) throws -> [String: String] {
    let doc = try SwiftSoup.parse(html)
    var twitterTags: [String: String] = [:]

    let twitterElements = try doc.select("meta[name^=twitter:]")

    for element in twitterElements {
        let name = try element.attr("name")
        let content = try element.attr("content")
        twitterTags[name] = content
    }

    return twitterTags
}

Comprehensive Meta Tag Extractor Class

Here's a robust class for extracting various types of meta tags:

import SwiftSoup

class MetaTagExtractor {

    struct MetaData {
        let title: String?
        let description: String?
        let keywords: String?
        let author: String?
        let viewport: String?
        let robots: String?
        let openGraph: [String: String]
        let twitterCard: [String: String]
        let customMeta: [String: String]
    }

    static func extractMetaData(from html: String) throws -> MetaData {
        let doc = try SwiftSoup.parse(html)

        // Extract standard meta tags
        let title = try? doc.select("title").first()?.text()
        let description = try? doc.select("meta[name=description]").first()?.attr("content")
        let keywords = try? doc.select("meta[name=keywords]").first()?.attr("content")
        let author = try? doc.select("meta[name=author]").first()?.attr("content")
        let viewport = try? doc.select("meta[name=viewport]").first()?.attr("content")
        let robots = try? doc.select("meta[name=robots]").first()?.attr("content")

        // Extract Open Graph tags
        var openGraph: [String: String] = [:]
        let ogElements = try doc.select("meta[property^=og:]")
        for element in ogElements {
            let property = try element.attr("property")
            let content = try element.attr("content")
            openGraph[property] = content
        }

        // Extract Twitter Card tags
        var twitterCard: [String: String] = [:]
        let twitterElements = try doc.select("meta[name^=twitter:]")
        for element in twitterElements {
            let name = try element.attr("name")
            let content = try element.attr("content")
            twitterCard[name] = content
        }

        // Extract custom meta tags
        var customMeta: [String: String] = [:]
        let allMetaElements = try doc.select("meta[name]")
        for element in allMetaElements {
            let name = try element.attr("name")
            let content = try element.attr("content")

            // Skip standard meta tags
            if !["description", "keywords", "author", "viewport", "robots"].contains(name) &&
               !name.hasPrefix("twitter:") {
                customMeta[name] = content
            }
        }

        return MetaData(
            title: title,
            description: description,
            keywords: keywords,
            author: author,
            viewport: viewport,
            robots: robots,
            openGraph: openGraph,
            twitterCard: twitterCard,
            customMeta: customMeta
        )
    }
}

Error Handling and Best Practices

When extracting meta tags, it's important to handle potential errors gracefully:

func safeMetaExtraction(from html: String) {
    do {
        let metaData = try MetaTagExtractor.extractMetaData(from: html)

        // Safely access optional values
        if let description = metaData.description, !description.isEmpty {
            print("Page Description: \(description)")
        } else {
            print("No description meta tag found")
        }

        // Process Open Graph data
        if !metaData.openGraph.isEmpty {
            print("Open Graph tags found:")
            metaData.openGraph.forEach { key, value in
                print("  \(key): \(value)")
            }
        }

    } catch SwiftSoupError.Error(let type, let message) {
        print("SwiftSoup Error - Type: \(type), Message: \(message)")
    } catch {
        print("Unexpected error: \(error)")
    }
}

Working with Remote HTML Content

When scraping web pages, you'll often need to fetch HTML content from URLs. Here's how to combine URLSession with SwiftSoup:

import Foundation

func extractMetaFromURL(_ urlString: String, completion: @escaping (MetaTagExtractor.MetaData?) -> Void) {
    guard let url = URL(string: urlString) else {
        completion(nil)
        return
    }

    URLSession.shared.dataTask(with: url) { data, response, error in
        guard let data = data,
              let html = String(data: data, encoding: .utf8) else {
            completion(nil)
            return
        }

        do {
            let metaData = try MetaTagExtractor.extractMetaData(from: html)
            completion(metaData)
        } catch {
            print("Error extracting meta data: \(error)")
            completion(nil)
        }
    }.resume()
}

Performance Considerations

For large-scale meta tag extraction, consider these optimization strategies:

  1. Selective Parsing: Only parse the <head> section when possible
  2. Caching: Cache frequently accessed meta data
  3. Asynchronous Processing: Use background queues for multiple extractions
func extractMetaFromHead(html: String) throws -> MetaTagExtractor.MetaData {
    // Extract only the head section for faster parsing
    if let headStart = html.range(of: "<head>", options: .caseInsensitive),
       let headEnd = html.range(of: "</head>", options: .caseInsensitive) {
        let headContent = String(html[headStart.lowerBound..<headEnd.upperBound])
        return try MetaTagExtractor.extractMetaData(from: headContent)
    }

    // Fall back to full document parsing
    return try MetaTagExtractor.extractMetaData(from: html)
}

Integration with Web Scraping Workflows

Meta tag extraction is often part of larger web scraping operations. When building comprehensive scraping solutions, you might want to combine SwiftSoup with other tools or APIs. For complex JavaScript-heavy sites that require dynamic content loading, consider using solutions that can handle JavaScript-rendered content when scraping alongside SwiftSoup for static HTML parsing.

For scenarios where you need to handle authentication in Puppeteer or other browser automation tools, you can extract the initial meta tags using SwiftSoup and then use more advanced tools for dynamic content that requires user sessions.

Common Pitfalls and Solutions

  1. Missing Meta Tags: Always check if elements exist before accessing attributes
  2. Encoding Issues: Ensure proper character encoding when fetching remote content
  3. Malformed HTML: SwiftSoup is forgiving, but validate critical meta data
  4. Case Sensitivity: Meta tag names and attributes can vary in case
// Robust meta tag extraction with fallbacks
func extractDescriptionWithFallback(from doc: Document) throws -> String? {
    // Try standard description
    if let desc = try doc.select("meta[name=description]").first()?.attr("content"),
       !desc.isEmpty {
        return desc
    }

    // Try Open Graph description
    if let ogDesc = try doc.select("meta[property='og:description']").first()?.attr("content"),
       !ogDesc.isEmpty {
        return ogDesc
    }

    // Try Twitter description
    if let twitterDesc = try doc.select("meta[name='twitter:description']").first()?.attr("content"),
       !twitterDesc.isEmpty {
        return twitterDesc
    }

    return nil
}

Advanced Techniques for Specific Meta Tags

Extracting Structured Data (JSON-LD)

Many modern websites include structured data in JSON-LD format within script tags:

func extractJSONLD(from html: String) throws -> [String: Any]? {
    let doc = try SwiftSoup.parse(html)

    let scriptElements = try doc.select("script[type='application/ld+json']")

    for scriptElement in scriptElements {
        let jsonString = try scriptElement.html()

        if let jsonData = jsonString.data(using: .utf8),
           let jsonObject = try JSONSerialization.jsonObject(with: jsonData, options: []) as? [String: Any] {
            return jsonObject
        }
    }

    return nil
}

Extracting Canonical URLs

Canonical URLs are important for SEO and content management:

func extractCanonicalURL(from html: String) throws -> String? {
    let doc = try SwiftSoup.parse(html)

    // Check for link rel="canonical"
    if let canonicalElement = try doc.select("link[rel=canonical]").first() {
        return try canonicalElement.attr("href")
    }

    // Fallback to Open Graph URL
    if let ogUrlElement = try doc.select("meta[property='og:url']").first() {
        return try ogUrlElement.attr("content")
    }

    return nil
}

Testing Your Meta Tag Extraction

It's important to test your meta tag extraction with various HTML samples:

import XCTest

class MetaTagExtractionTests: XCTestCase {

    func testBasicMetaTagExtraction() throws {
        let html = """
        <!DOCTYPE html>
        <html>
        <head>
            <title>Test Page</title>
            <meta name="description" content="Test description">
            <meta name="keywords" content="swift, swiftsoup, testing">
            <meta property="og:title" content="OG Title">
            <meta name="twitter:card" content="summary">
        </head>
        <body></body>
        </html>
        """

        let metaData = try MetaTagExtractor.extractMetaData(from: html)

        XCTAssertEqual(metaData.title, "Test Page")
        XCTAssertEqual(metaData.description, "Test description")
        XCTAssertEqual(metaData.keywords, "swift, swiftsoup, testing")
        XCTAssertEqual(metaData.openGraph["og:title"], "OG Title")
        XCTAssertEqual(metaData.twitterCard["twitter:card"], "summary")
    }

    func testMissingMetaTags() throws {
        let html = """
        <!DOCTYPE html>
        <html>
        <head>
            <title>Minimal Page</title>
        </head>
        <body></body>
        </html>
        """

        let metaData = try MetaTagExtractor.extractMetaData(from: html)

        XCTAssertEqual(metaData.title, "Minimal Page")
        XCTAssertNil(metaData.description)
        XCTAssertTrue(metaData.openGraph.isEmpty)
        XCTAssertTrue(metaData.twitterCard.isEmpty)
    }
}

Conclusion

SwiftSoup provides a powerful and flexible way to extract meta tag content from HTML documents. Whether you're building an SEO analyzer, social media preview generator, or content management system, the techniques covered in this guide will help you efficiently extract and process meta tag information. Remember to handle errors gracefully, validate extracted data, and consider performance implications when processing large volumes of content.

The key to successful meta tag extraction is understanding the structure of the HTML you're parsing and using appropriate CSS selectors to target the specific meta tags you need. With SwiftSoup's intuitive API and the examples provided here, you'll be able to build robust meta tag extraction functionality for your Swift applications.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon