Table of contents

How do I extract form data from HTML using SwiftSoup?

SwiftSoup is a powerful HTML parsing library for Swift that allows iOS developers to extract and manipulate HTML content efficiently. When it comes to extracting form data, SwiftSoup provides robust methods to locate form elements, input fields, and their associated values. This comprehensive guide will show you how to extract various types of form data using SwiftSoup.

Understanding HTML Forms

HTML forms contain various input elements such as text fields, checkboxes, radio buttons, select dropdowns, and textareas. Each form element has attributes like name, value, type, and id that SwiftSoup can target for data extraction.

Basic Form Data Extraction

Setting Up SwiftSoup

First, ensure you have SwiftSoup installed in your iOS project:

import SwiftSoup

Extracting All Forms

To extract all forms from an HTML document:

import SwiftSoup

func extractAllForms(from html: String) throws -> [Element] {
    let doc = try SwiftSoup.parse(html)
    let forms = try doc.select("form")
    return forms
}

// Usage example
let htmlContent = """
<html>
<body>
    <form id="loginForm" action="/login" method="post">
        <input type="text" name="username" value="john_doe">
        <input type="password" name="password" value="">
        <input type="submit" value="Login">
    </form>
    <form id="signupForm" action="/signup" method="post">
        <input type="email" name="email" value="">
        <input type="text" name="fullName" value="">
    </form>
</body>
</html>
"""

do {
    let forms = try extractAllForms(from: htmlContent)
    print("Found \(forms.count) forms")

    for form in forms {
        let formId = try form.attr("id")
        let action = try form.attr("action")
        let method = try form.attr("method")
        print("Form ID: \(formId), Action: \(action), Method: \(method)")
    }
} catch {
    print("Error: \(error)")
}

Extracting Input Fields

Text Inputs and Basic Fields

func extractInputFields(from html: String) throws -> [(name: String, value: String, type: String)] {
    let doc = try SwiftSoup.parse(html)
    let inputs = try doc.select("input[name]")

    var inputData: [(name: String, value: String, type: String)] = []

    for input in inputs {
        let name = try input.attr("name")
        let value = try input.attr("value")
        let type = try input.attr("type")

        inputData.append((name: name, value: value, type: type))
    }

    return inputData
}

// Example usage
let formHTML = """
<form>
    <input type="text" name="firstName" value="John">
    <input type="text" name="lastName" value="Doe">
    <input type="email" name="email" value="john@example.com">
    <input type="hidden" name="userId" value="12345">
    <input type="number" name="age" value="30">
</form>
"""

do {
    let inputs = try extractInputFields(from: formHTML)
    for input in inputs {
        print("Name: \(input.name), Value: \(input.value), Type: \(input.type)")
    }
} catch {
    print("Error extracting inputs: \(error)")
}

Handling Checkboxes and Radio Buttons

func extractCheckboxesAndRadios(from html: String) throws -> [String: Any] {
    let doc = try SwiftSoup.parse(html)
    var formData: [String: Any] = [:]

    // Extract checked checkboxes
    let checkedBoxes = try doc.select("input[type=checkbox]:checked")
    var selectedCheckboxes: [String] = []

    for checkbox in checkedBoxes {
        let name = try checkbox.attr("name")
        let value = try checkbox.attr("value")
        selectedCheckboxes.append("\(name): \(value)")
    }
    formData["checkboxes"] = selectedCheckboxes

    // Extract selected radio buttons
    let selectedRadios = try doc.select("input[type=radio]:checked")
    for radio in selectedRadios {
        let name = try radio.attr("name")
        let value = try radio.attr("value")
        formData[name] = value
    }

    return formData
}

// Example HTML with checkboxes and radio buttons
let checkboxRadioHTML = """
<form>
    <input type="checkbox" name="interests" value="sports" checked>
    <input type="checkbox" name="interests" value="music" checked>
    <input type="checkbox" name="interests" value="reading">

    <input type="radio" name="gender" value="male" checked>
    <input type="radio" name="gender" value="female">
    <input type="radio" name="gender" value="other">
</form>
"""

do {
    let data = try extractCheckboxesAndRadios(from: checkboxRadioHTML)
    print("Form data: \(data)")
} catch {
    print("Error: \(error)")
}

Working with Select Elements

Extracting Dropdown Values

func extractSelectOptions(from html: String) throws -> [String: [String]] {
    let doc = try SwiftSoup.parse(html)
    let selects = try doc.select("select")
    var selectData: [String: [String]] = [:]

    for select in selects {
        let name = try select.attr("name")
        let options = try select.select("option")

        var optionValues: [String] = []
        var selectedValue: String?

        for option in options {
            let value = try option.attr("value")
            let text = try option.text()
            let isSelected = try option.hasAttr("selected")

            optionValues.append("\(text) (\(value))")

            if isSelected {
                selectedValue = value
            }
        }

        selectData[name] = optionValues
        if let selected = selectedValue {
            selectData["\(name)_selected"] = [selected]
        }
    }

    return selectData
}

let selectHTML = """
<form>
    <select name="country">
        <option value="us">United States</option>
        <option value="uk" selected>United Kingdom</option>
        <option value="ca">Canada</option>
    </select>

    <select name="language" multiple>
        <option value="en" selected>English</option>
        <option value="es" selected>Spanish</option>
        <option value="fr">French</option>
    </select>
</form>
"""

do {
    let selectData = try extractSelectOptions(from: selectHTML)
    for (key, values) in selectData {
        print("\(key): \(values)")
    }
} catch {
    print("Error: \(error)")
}

Extracting Textarea Content

func extractTextareas(from html: String) throws -> [String: String] {
    let doc = try SwiftSoup.parse(html)
    let textareas = try doc.select("textarea")
    var textareaData: [String: String] = [:]

    for textarea in textareas {
        let name = try textarea.attr("name")
        let content = try textarea.text()
        textareaData[name] = content
    }

    return textareaData
}

let textareaHTML = """
<form>
    <textarea name="comments">This is a sample comment.</textarea>
    <textarea name="description">Product description goes here.</textarea>
</form>
"""

do {
    let textareas = try extractTextareas(from: textareaHTML)
    for (name, content) in textareas {
        print("Textarea \(name): \(content)")
    }
} catch {
    print("Error: \(error)")
}

Comprehensive Form Data Extraction

Here's a complete function that extracts all types of form data:

struct FormField {
    let name: String
    let value: String
    let type: String
    let isChecked: Bool?
    let isSelected: Bool?
}

struct FormData {
    let action: String
    let method: String
    let fields: [FormField]
}

func extractCompleteFormData(from html: String, formId: String? = nil) throws -> [FormData] {
    let doc = try SwiftSoup.parse(html)
    let forms = formId != nil ? try doc.select("form#\(formId!)") : try doc.select("form")

    var formsData: [FormData] = []

    for form in forms {
        let action = try form.attr("action")
        let method = try form.attr("method")
        var fields: [FormField] = []

        // Extract input fields
        let inputs = try form.select("input")
        for input in inputs {
            let name = try input.attr("name")
            let value = try input.attr("value")
            let type = try input.attr("type")
            let isChecked = try input.hasAttr("checked") ? true : nil

            if !name.isEmpty {
                fields.append(FormField(name: name, value: value, type: type, isChecked: isChecked, isSelected: nil))
            }
        }

        // Extract select elements
        let selects = try form.select("select")
        for select in selects {
            let name = try select.attr("name")
            let options = try select.select("option")

            for option in options {
                let value = try option.attr("value")
                let text = try option.text()
                let isSelected = try option.hasAttr("selected") ? true : nil

                fields.append(FormField(name: name, value: value, type: "option", isChecked: nil, isSelected: isSelected))
            }
        }

        // Extract textarea elements
        let textareas = try form.select("textarea")
        for textarea in textareas {
            let name = try textarea.attr("name")
            let content = try textarea.text()

            fields.append(FormField(name: name, value: content, type: "textarea", isChecked: nil, isSelected: nil))
        }

        formsData.append(FormData(action: action, method: method, fields: fields))
    }

    return formsData
}

Advanced Form Extraction Techniques

Extracting Form Data with Validation

func extractFormDataWithValidation(from html: String) throws -> [String: Any] {
    let doc = try SwiftSoup.parse(html)
    var formData: [String: Any] = [:]
    var validationErrors: [String] = []

    // Extract required fields
    let requiredFields = try doc.select("input[required], select[required], textarea[required]")

    for field in requiredFields {
        let name = try field.attr("name")
        let value = try field.attr("value")
        let tagName = field.tagName()

        if tagName == "textarea" {
            let textContent = try field.text()
            if textContent.isEmpty {
                validationErrors.append("Required field '\(name)' is empty")
            }
        } else if value.isEmpty && tagName != "select" {
            validationErrors.append("Required field '\(name)' is empty")
        }

        formData[name] = value
    }

    formData["validation_errors"] = validationErrors
    return formData
}

Extracting Form Labels and Context

func extractFormFieldsWithLabels(from html: String) throws -> [(field: String, label: String, value: String)] {
    let doc = try SwiftSoup.parse(html)
    var fieldsWithLabels: [(field: String, label: String, value: String)] = []

    let inputs = try doc.select("input[name], select[name], textarea[name]")

    for input in inputs {
        let name = try input.attr("name")
        let id = try input.attr("id")
        var value = try input.attr("value")
        let tagName = input.tagName()

        // For textarea, get text content
        if tagName == "textarea" {
            value = try input.text()
        }

        // For select, get selected option text
        if tagName == "select" {
            let selectedOption = try input.select("option[selected]").first()
            if let option = selectedOption {
                value = try option.text()
            }
        }

        // Try to find associated label
        var labelText = ""

        // Look for label by 'for' attribute
        if !id.isEmpty {
            let labels = try doc.select("label[for=\(id)]")
            if let label = labels.first() {
                labelText = try label.text()
            }
        }

        // Look for parent label
        if labelText.isEmpty {
            let parentLabel = try input.parent()?.closest("label")
            if let label = parentLabel {
                labelText = try label.text()
            }
        }

        fieldsWithLabels.append((field: name, label: labelText, value: value))
    }

    return fieldsWithLabels
}

JavaScript and Dynamic Forms

For forms with dynamic content generated by JavaScript, SwiftSoup can only parse static HTML. Consider combining SwiftSoup with how to handle authentication in Puppeteer for scenarios requiring JavaScript execution.

Error Handling and Best Practices

Robust Error Handling

enum FormExtractionError: Error {
    case invalidHTML
    case formNotFound
    case missingRequiredAttribute
}

func safeExtractFormData(from html: String) -> Result<[FormData], FormExtractionError> {
    do {
        let formData = try extractCompleteFormData(from: html)
        if formData.isEmpty {
            return .failure(.formNotFound)
        }
        return .success(formData)
    } catch {
        if error is SwiftSoup.Exception {
            return .failure(.invalidHTML)
        }
        return .failure(.missingRequiredAttribute)
    }
}

// Usage with proper error handling
let result = safeExtractFormData(from: htmlContent)

switch result {
case .success(let forms):
    print("Successfully extracted \(forms.count) forms")
    for form in forms {
        print("Form action: \(form.action)")
        print("Fields count: \(form.fields.count)")
    }
case .failure(let error):
    print("Failed to extract form data: \(error)")
}

Form Submission Preparation

func prepareFormForSubmission(formData: FormData) -> [String: String] {
    var submissionData: [String: String] = [:]

    for field in formData.fields {
        switch field.type {
        case "checkbox":
            // Only include checked checkboxes
            if field.isChecked == true {
                submissionData[field.name] = field.value
            }
        case "radio":
            // Only include selected radio button
            if field.isChecked == true {
                submissionData[field.name] = field.value
            }
        case "option":
            // Only include selected options
            if field.isSelected == true {
                submissionData[field.name] = field.value
            }
        case "submit", "button":
            // Skip submit buttons unless specifically needed
            break
        default:
            // Include all other field types
            submissionData[field.name] = field.value
        }
    }

    return submissionData
}

Performance Considerations

When working with large HTML documents or multiple forms:

  1. Use specific selectors: Instead of select("*"), use targeted selectors like select("form input[name]")
  2. Limit parsing scope: If you know the form's location, parse only that section
  3. Cache parsed documents: For repeated operations on the same HTML, parse once and reuse
class FormExtractor {
    private var cachedDocument: Document?

    func extractForms(from html: String, useCache: Bool = true) throws -> [FormData] {
        let doc: Document

        if useCache && cachedDocument != nil {
            doc = cachedDocument!
        } else {
            doc = try SwiftSoup.parse(html)
            if useCache {
                cachedDocument = doc
            }
        }

        return try extractCompleteFormData(from: html)
    }

    func clearCache() {
        cachedDocument = nil
    }
}

Integration with Web Scraping Workflows

SwiftSoup works excellently for parsing static HTML content. For scenarios involving complex form interactions or how to inject JavaScript into a page using Puppeteer, you might need to combine SwiftSoup with HTTP client libraries or web automation tools.

Real-World Example: Contact Form Extraction

func extractContactForm(from html: String) throws -> ContactFormData? {
    let doc = try SwiftSoup.parse(html)

    // Look for contact or feedback forms
    let contactForms = try doc.select("form[action*='contact'], form[action*='feedback'], form[id*='contact']")

    guard let form = contactForms.first() else {
        return nil
    }

    var formFields: [String: String] = [:]

    // Extract common contact form fields
    let nameField = try form.select("input[name*='name'], input[id*='name']").first()
    let emailField = try form.select("input[type='email'], input[name*='email']").first()
    let messageField = try form.select("textarea[name*='message'], textarea[name*='comment']").first()

    if let name = nameField {
        formFields["name"] = try name.attr("value")
    }

    if let email = emailField {
        formFields["email"] = try email.attr("value")
    }

    if let message = messageField {
        formFields["message"] = try message.text()
    }

    return ContactFormData(
        action: try form.attr("action"),
        method: try form.attr("method"),
        fields: formFields
    )
}

struct ContactFormData {
    let action: String
    let method: String
    let fields: [String: String]
}

Conclusion

SwiftSoup provides powerful capabilities for extracting form data from HTML documents in iOS applications. Whether you're dealing with simple contact forms or complex multi-step forms, SwiftSoup's CSS selector-based approach makes form data extraction straightforward and reliable. The key to success is understanding the HTML structure, using appropriate selectors, and implementing proper error handling for robust applications.

Remember to always validate and sanitize extracted form data before using it in your applications, especially when dealing with user-generated content or external HTML sources. For dynamic forms that require JavaScript execution, consider integrating SwiftSoup with web scraping APIs or automation tools that can handle rendered content.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon