swift - 允许的字符正在被百分比编码
问题描述
字符串方法的文档addingPercentEncoding(withAllowedCharacters:)
:
通过用百分比编码的字符替换所有不在指定集中的字符,返回由接收器生成的新字符串。
预定义的集合CharacterSet.alphanumerics
说:
返回一个字符集,其中包含 Unicode 通用类别 L*、M* 和 N* 中的字符。
(字母) 类别由 5 个子类别组成L
:、、、、。所以我假设意味着 L 的所有子类别。Ll
Lm
Lt
Lu
Lo
L*
我将选择查看Ll
子类别 ( https://www.compart.com/en/unicode/category/Ll#UNC_SCRIPTS ),然后选择字符"æ"
(U+00E6)。然后我可以看到字母数字字符集确实包含这个字符。但是当我将百分比编码添加到包含此字符的字符串时,它会得到百分比编码。
"\u{E6}" // "æ"
CharacterSet.alphanumerics.contains("\u{E6}") // true
"æ".addingPercentEncoding(withAllowedCharacters: .alphanumerics) // "%C3%A6"
// Let's try with "a"
"\u{61}" // "a"
CharacterSet.alphanumerics.contains("\u{61}") // true
"a".addingPercentEncoding(withAllowedCharacters: .alphanumerics) // "a"
为什么会这样?它在我传入的允许字符集中,所以它不应该被替换,对吧?
我觉得这与"a"
(U+0061) 也在0x61
但UTF-8
( "æ"
U+00E6)的事实有关[0xC3, 0xA6]
;不是0xE6
。或者它占用超过1个字节?
String(data: Data([0x61]), encoding: .utf8)! // "a"
String(data: Data([0xC3, 0xA6]), encoding: .utf8)! // "æ"
String(data: Data([0xE6]), encoding: .utf8)! // crashes
更新
是因为百分比编码算法将字符串转换为Data
一次并通过 1 个字节吗?所以它会看看0xC3
哪个不是允许的字符,所以得到百分比编码。然后它会查看0xA6
哪个也不是允许的字符,因此也可以进行百分比编码。所以从技术上讲,允许的字符必须是一个字节?
解决方案
真正允许的字符必须在允许的字符集中,并且是 ASCII 字符。感谢@alobaili 指出这一点。
如果你很好奇,预定义的集合总共CharacterSet.alphanumerics
包含129172
字符,但只有在将此集合传递给字符串的方法62
时才真正允许。addingPercentEncoding(allowedSet:)
可以像这样快速检查特定中所有真正允许的字符CharacterSet
:
func inspect(charSet: CharacterSet) {
var characters: [String] = []
for char: UInt8 in 0..<128 { // ASCII range
let u = UnicodeScalar(char)
if charSet.contains(u) {
characters.append(String(u))
}
}
print("Characters:", characters.count)
print(characters)
}
inspect(charSet: .alphanumerics) // [a-z, A-Z, 0-9]
这很方便,因为您不能简单地遍历CharacterSet
. 了解那些允许的元素是什么会很有用。例如,预定义CharacterSet.urlQueryAllowed
只说:
返回查询 URL 组件中允许的字符的字符集。
我们可以知道那些允许的字符是什么:
inspect(charSet: .urlQueryAllowed)
// Characters: 81
// ["!", "$", "&", "\'", "(", ")", "*", "+", ",", "-", ".", "/", "0", "1", "2", "3", "4", "5", "6", "7", "8", "9", ":", ";", "=", "?", "@", "A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z", "_", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "~"]
纯娱乐
还有另一种(长而可靠)的方法,它查看集合中的所有字符(不仅仅是 ASCII 字符),并将字符本身的字符串与添加百分比编码后的字符串进行比较,其中仅包含该字符允许集。当这两个相等时,您就知道它确实是允许的字符。改编自这篇有用的文章的代码。
func inspect(charSet: CharacterSet) {
var characters: [String] = []
var allowed: [String] = []
var asciiCount = 0
for plane: UInt8 in 0..<17 {
if charSet.hasMember(inPlane: plane) {
let planeStart = UInt32(plane) << 16
let nextPlaneStart = (UInt32(plane) + 1) << 16
for char: UTF32Char in planeStart..<nextPlaneStart {
if let u = UnicodeScalar(char), charSet.contains(u) {
let s = String(u)
characters.append(s)
if s.addingPercentEncoding(withAllowedCharacters: CharacterSet([u])) == s {
allowed.append(s)
}
if u.isASCII {
asciiCount += 1
}
}
}
}
}
print("Characters:", characters.count)
print("Allowed:", allowed.count)
print("ASCII:", asciiCount)
}
inspect(charSet: .alphanumerics)
// Characters: 129172
// Allowed: 62
// ASCII: 62