首页 > 解决方案 > Use re to duplicate \p{P} from regex?

问题描述

Is there a way to duplicate the unicode category matching capability of regex using just re? I have an re match string which identifies words (r'\b[^\W\d_]+\b') which I would like to amend so that punctuation which is attached to the word (i.e. has no non-punctuation characters intervening between the word and the character) is included in the match. Using regex I would do r'\b[^\W\d_]+\b\p{P}*' but I cannot be sure that regex will be installed on all the systems to which the final script will be deployed and thus would like to rework the match condition to be entirely re compatible. Is that possible, and if so how would I do it?

标签: pythonregexpython-3.x

解决方案


要复制\p{P}功能,您必须使用unicodedata模块手动构建集合;您仍然需要手动过滤所有代码点:

import re
import sys
from unicodedata import category

p_class = re.escape(''.join([
    c for c in map(chr, range(sys.maxunicode))
    if category(c)[0] == 'P']))

pattern = re.compile(rf'\b[^\W\d_]+\b[{p_class}]*')

就个人而言,我现在只是安装regex,而不是尝试手动构建巨大的字符集。

您可能想要对实际需要匹配的字符类型进行一些统计分析,而不是所有的 punctuation,以缩小该集合,或者用“不是单词字符或空格”来表达它,with [^\w\s]*,这是更广泛的但匹配速度更快。


推荐阅读