python - Regex - Negative Lookahead 以匹配具有任何非中文 UTF 字符的字符串
问题描述
解决方案
First of all, \u20000
doesn't mean what you think it does. Because \u
sequences must be exactly 4 digits long, that's refers to U+2000
and the digit 0
. For characters above 0xFFFF, Python provides \U
, which must be followed by exactly 8 digits (e.g. \U00020000
).
Secondly,
[A-B]|[C-D]|...
is best written as
[A-BC-D...]
With the above fix and the above simplification, we have this:
[\u3400-\u4BDF\u4E00-\u9FFF\uF900-\uFAFF\U00020000-\U0002A6DF\U0002A700-\U0002B73F\U0002B740-\U0002B81F\U0002B820-\U0002CEAF\U0002F800-\U0002FA1F]
There are two ways of approaching the problem:
Does the string contain only characters from that class?
is_just_han = re.search("^[...]*$", str) # or regex.search
Does the string contain a character from outside of that class?
is_just_han = not re.search("[^...]", str) # or regex.search
If you use the regex module instead of the re module, you gain access to \p{Han}
(short for \p{Script=Han}
) and its negation \P{Han}
(short for \P{Script=Han}
). This Unicode property is a close match for the characters you are trying to match. I'll let you determine if it's right for you or not.
is_just_han = regex.search("^\p{Han}*$", str)
is_just_han = regex.search("\P{Han}", str)
推荐阅读
- python - FizzBuzz 程序从不满足条件之一
- intellij-idea - Cucumber:如何指定要在 Intellij vm“运行配置”中运行的 2 个功能文件?
- angular - 如何检测 ion-content 是否有滚动条?
- terraform - 从 S3 远程状态导入 terraform 工作区
- activity-lifecycle - 有没有办法避免多次触发 nativescript 活动生命周期事件
- c# - 我怎样才能找到 IEvolution2 版本 4.4.0.0
- http - 如何在循环中等待 Http 请求完成执行以便继续执行一些相关的代码?
- typescript - TypeScript 中的 async/await 和 Promise 是如何工作的
- python - 如果转换为序列或映射,则进行不同转换的类
- reactjs - React Hooks - 返回未定义的映射函数