python - 在 python3 中调试无效的 utf-8 字符
问题描述
我正在尝试调试为什么我的 python3 脚本中的某些字符串具有非 utf8 字符。我发现这个脚本应该识别这些字符:
该网站为其提供了python代码:
regex = r"""
(?:
[\xC0-\xC1] # Invalid UTF-8 Bytes
| [\xF5-\xFF] # Invalid UTF-8 Bytes
| \xE0[\x80-\x9F] # Overlong encoding of prior code point
| \xF0[\x80-\x8F] # Overlong encoding of prior code point
| [\xC2-\xDF](?![\x80-\xBF]) # Invalid UTF-8 Sequence Start
| [\xE0-\xEF](?![\x80-\xBF]{2}) # Invalid UTF-8 Sequence Start
| [\xF0-\xF4](?![\x80-\xBF]{3}) # Invalid UTF-8 Sequence Start
| (?<=[\x0-\x7F\xF5-\xFF])[\x80-\xBF] # Invalid UTF-8 Sequence Middle
| (?<![\xC2-\xDF]|[\xE0-\xEF]|[\xE0-\xEF][\x80-\xBF]|[\xF0-\xF4]|[\xF0-\xF4][\x80-\xBF]|[\xF0-\xF4][\x80-\xBF]{2})[\x80-\xBF] # Overlong Sequence
| (?<=[\xE0-\xEF])[\x80-\xBF](?![\x80-\xBF]) # Short 3 byte sequence
| (?<=[\xF0-\xF4])[\x80-\xBF](?![\x80-\xBF]{2}) # Short 4 byte sequence
| (?<=[\xF0-\xF4][\x80-\xBF])[\x80-\xBF](?![\x80-\xBF]) # Short 4 byte sequence (2)
)
"""
def stripNonUtf8(str):
matches = re.search(regex, str, re.VERBOSE)
if matches:
print ("Match was found at {start}-{end}: {match}".format(start = matches.start(), end = matches.end(), match = matches.group()))
但我收到以下错误:
Traceback (most recent call last):
File "log2db.py", line 330, in <module>
main()
File "log2db.py", line 325, in main
stripNonUtf8("aaa")
File "log2db.py", line 38, in stripNonUtf8
matches = re.search(regex, str, re.VERBOSE)
File "C:\ProgramData\Anaconda3\lib\re.py", line 183, in search
return _compile(pattern, flags).search(string)
File "C:\ProgramData\Anaconda3\lib\re.py", line 286, in _compile
p = sre_compile.compile(pattern, flags)
File "C:\ProgramData\Anaconda3\lib\sre_compile.py", line 764, in compile
p = sre_parse.parse(p, flags)
File "C:\ProgramData\Anaconda3\lib\sre_parse.py", line 930, in parse
p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
File "C:\ProgramData\Anaconda3\lib\sre_parse.py", line 426, in _parse_sub
not nested and not items))
File "C:\ProgramData\Anaconda3\lib\sre_parse.py", line 816, in _parse
p = _parse_sub(source, state, sub_verbose, nested + 1)
File "C:\ProgramData\Anaconda3\lib\sre_parse.py", line 426, in _parse_sub
not nested and not items))
File "C:\ProgramData\Anaconda3\lib\sre_parse.py", line 736, in _parse
p = _parse_sub(source, state, verbose, nested + 1)
File "C:\ProgramData\Anaconda3\lib\sre_parse.py", line 426, in _parse_sub
not nested and not items))
File "C:\ProgramData\Anaconda3\lib\sre_parse.py", line 536, in _parse
code1 = _class_escape(source, this)
File "C:\ProgramData\Anaconda3\lib\sre_parse.py", line 309, in _class_escape
raise source.error("incomplete escape %s" % escape, len(escape))
re.error: incomplete escape \x0 at position 411 (line 10, column 11)
到底是怎么回事?
解决方案
与 C 不同,在 Python 中,需要用 2 位数字指定具有十六进制值的字符。
请参阅String 和 Bytes literals的文档,其中注明:
与标准 C 不同,需要两个十六进制数字。
所以代码应该固定为:
| (?<=[\x00-\x7F\xF5-\xFF])[\x80-\xBF] # Invalid UTF-8 Sequence Middle
另外,Python 标准re
模块的能力也比较有限。您可以安装正则表达式模块 ( pip install regex
) 并import regex as re
解决这些限制。
推荐阅读
- excel - 如何插入到 ListView 框的中间
- java - @Post Jersey 在带有 javax.ws.rs.POST 注释的方法中只允许有一个未注释的参数
- bash - 如何在一行 Bash 命令中发送多个 mailx 命令(例如读取/打印和删除)
- corda - 在 Corda 流中,java.lang.IllegalStateException: Attempted to initialFlow() 在同一个 InitiatingFlow 中两次
- maven - docker-maven-plugin(spotify) 构建多个模块
- python - Gurobi Python:不支持的类型(
) 对于 LinExpr 添加参数错误 - java - 在 Tomcat 服务器重新启动之前,无法可靠地从数据库加载数据
- r - 在外部 LaTeX 文件中将 YAML 参数作为宏访问
- javascript - axios POST请求后更新状态
- tomcat - Jboss vs tomcat(清除java ee支持的阴影)