首页 > 解决方案 > Regex - Negative Lookahead 以匹配具有任何非中文 UTF 字符的字符串

问题描述

标签: pythonpython-3.xregexregex-lookarounds

解决方案


First of all, \u20000 doesn't mean what you think it does. Because \u sequences must be exactly 4 digits long, that's refers to U+2000 and the digit 0. For characters above 0xFFFF, Python provides \U, which must be followed by exactly 8 digits (e.g. \U00020000).


Secondly,

[A-B]|[C-D]|...

is best written as

[A-BC-D...]

With the above fix and the above simplification, we have this:

[\u3400-\u4BDF\u4E00-\u9FFF\uF900-\uFAFF\U00020000-\U0002A6DF\U0002A700-\U0002B73F\U0002B740-\U0002B81F\U0002B820-\U0002CEAF\U0002F800-\U0002FA1F]

There are two ways of approaching the problem:

  1. Does the string contain only characters from that class?

    is_just_han = re.search("^[...]*$", str)     # or regex.search
    
  2. Does the string contain a character from outside of that class?

    is_just_han = not re.search("[^...]", str)   # or regex.search
    

If you use the regex module instead of the re module, you gain access to \p{Han} (short for \p{Script=Han}) and its negation \P{Han} (short for \P{Script=Han}). This Unicode property is a close match for the characters you are trying to match. I'll let you determine if it's right for you or not.

is_just_han = regex.search("^\p{Han}*$", str)

is_just_han = regex.search("\P{Han}", str)

推荐阅读