SwiftSoup is a pure Swift library for working with real-world HTML, inspired by the popular Java library Jsoup. It provides a convenient API for extracting and manipulating data from HTML documents. SwiftSoup deals with HTML as it is found in the wild: imperfect, dirty, and sometimes broken. As such, it can handle different character encodings, much like its Java counterpart.
When SwiftSoup parses a document, it uses the character encoding specified in the HTML document itself. This is typically defined in either a meta
tag within the <head>
section of the HTML, or in the Content-Type
header of the HTTP response. If the encoding is specified, SwiftSoup will attempt to use that encoding to correctly interpret the characters in the document.
Here's an example of how a meta
tag in an HTML document might specify the character encoding:
<meta charset="UTF-8">
If the character encoding isn't specified or SwiftSoup can't determine it, the library will default to UTF-8, which is a common encoding that supports a wide range of characters.
Here is a simple example of how you might use SwiftSoup to parse an HTML document and handle its encoding:
import SwiftSoup
let html: String = "<html><head><meta charset='ISO-8859-1'></head><body>...</body></html>"
do {
let doc: Document = try SwiftSoup.parse(html)
// Process the document...
} catch {
// Handle error
}
In the code above, SwiftSoup will parse the HTML string using the ISO-8859-1 encoding specified in the meta
tag.
It's also possible to explicitly set the character encoding if you know what it is and it's not correctly declared in the document:
import SwiftSoup
let html: String = "<html>...</html>" // HTML string without proper encoding declaration
let encoding: String = "ISO-8859-1" // The encoding you know the document uses
do {
let doc: Document = try SwiftSoup.parse(html, nil, SwiftSoup.defaultSettings().charset(encoding))
// Process the document...
} catch {
// Handle error
}
If you're loading HTML from a URL, SwiftSoup will typically honor the HTTP Content-Type
header to determine the encoding. If you're reading from a file or a stream, you should ensure that the encoding is specified correctly in the document or set the encoding manually if necessary.
Remember that SwiftSoup is a Swift library, and Swift has strong Unicode support. As long as you let SwiftSoup handle the encoding, or you correctly specify the encoding when necessary, you should be able to work with HTML documents in various encodings without much trouble.