首页 > 解决方案 > Use re to duplicate \p{P} from regex?


Is there a way to duplicate the unicode category matching capability of regex using just re? I have an re match string which identifies words (r'\b[^\W\d_]+\b') which I would like to amend so that punctuation which is attached to the word (i.e. has no non-punctuation characters intervening between the word and the character) is included in the match. Using regex I would do r'\b[^\W\d_]+\b\p{P}*' but I cannot be sure that regex will be installed on all the systems to which the final script will be deployed and thus would like to rework the match condition to be entirely re compatible. Is that possible, and if so how would I do it?

标签: pythonregexpython-3.x



import re
import sys
from unicodedata import category

p_class = re.escape(''.join([
    c for c in map(chr, range(sys.maxunicode))
    if category(c)[0] == 'P']))

pattern = re.compile(rf'\b[^\W\d_]+\b[{p_class}]*')


您可能想要对实际需要匹配的字符类型进行一些统计分析,而不是所有的 punctuation,以缩小该集合,或者用“不是单词字符或空格”来表达它,with [^\w\s]*,这是更广泛的但匹配速度更快。
