python - Use re to duplicate \p{P} from regex?
问题描述
Is there a way to duplicate the unicode category matching capability of regex
using just re
? I have an re
match string which identifies words (r'\b[^\W\d_]+\b'
) which I would like to amend so that punctuation which is attached to the word (i.e. has no non-punctuation characters intervening between the word and the character) is included in the match. Using regex
I would do r'\b[^\W\d_]+\b\p{P}*'
but I cannot be sure that regex
will be installed on all the systems to which the final script will be deployed and thus would like to rework the match condition to be entirely re
compatible. Is that possible, and if so how would I do it?
解决方案
要复制\p{P}
功能,您必须使用unicodedata
模块手动构建集合;您仍然需要手动过滤所有代码点:
import re
import sys
from unicodedata import category
p_class = re.escape(''.join([
c for c in map(chr, range(sys.maxunicode))
if category(c)[0] == 'P']))
pattern = re.compile(rf'\b[^\W\d_]+\b[{p_class}]*')
就个人而言,我现在只是安装regex
,而不是尝试手动构建巨大的字符集。
您可能想要对实际需要匹配的字符类型进行一些统计分析,而不是所有的 punctuation,以缩小该集合,或者用“不是单词字符或空格”来表达它,with [^\w\s]*
,这是更广泛的但匹配速度更快。
推荐阅读
- load-balancing - HAProxy URL 负载平衡
- c++ - 如何从数字中获取 Unicode 字符?
- java - 在 AWS Lambda Java 中解析 Kinesis 数据流
- r - 在(反)对角线上应用函数
- javascript - 引导程序中的 net::ERR_ABORTED 500(内部服务器错误)
- batch-file - Windows 批处理文件 xcopy 同一文件夹中的特定文件
- angular - 为什么 ng serve 在进程监视器中显示多次
- javascript - 如何使用“toHaveBeenCalledWith”在 Jasmine 中断言布尔值
- compiler-errors - 在 Ubuntu 中构建 Code Composer Studio 嵌入式项目(带有指向外部项目的链接)时找不到头文件错误
- javascript - 如何使用 arguments.length 来查找传递给函数的参数数量?