python - 在 python 中使用正则表达式仅匹配未引用的单词
问题描述
在尝试处理一些代码时,我需要找到使用了某个列表中的变量的实例。问题是,代码被混淆了,这些变量名也可能出现在字符串中,例如,我不想匹配。
但是,我一直无法找到一个正则表达式来匹配只在 python 中工作的非引号单词......
解决方案
"[^\\\\]((\")|('))(?(2)([^\"]|\\\")*|([^']|\\')*)[^\\\\]\\1|(\w+)"
应该将任何未引用的单词匹配到最后一组(第 6 组,索引 5 和基于 0 的索引)。需要进行少量修改以避免匹配以引号开头的字符串。
解释:
[^\\\\] Match any character but an escape character. Escaped quotes do not start a string.
((\")|(')) Immediately after the non-escaped character, match either " or ', which starts a string. This is group 1, which contains groups 2 (\") and 3 (')
(?(2) if we matched group 2 (a double-quote)
([^\"]|\\\")*| match anything but double quotes, or match escaped double quotes. Otherwise:
([^']|\\')*) match anything but a single quote or match an escaped single quote.
If you wish to retrieve the string inside the quotes, you will have to add another group: (([^\"]|\\\")*) will allow you to retrieve the whole consumed string, rather than just the last matched character.
Note that the last character of a quoted string will actually be consumed by the last [^\\\\]. To retrieve it, you have to turn it into a group: ([^\\\\]). Additionally, The first character before the quote will also be consumed by [^\\\\], which might be meaningful in cases such as r"Raw\text".
[^\\\\]\\1 will match any non-escape character followed by what the first group matched again. That is, if ((\")|(')) matched a double quote, we requite a double quote to end the string. Otherwise, it matched a single quote, which is what we require to end the string.
|(\w+) will match any word. This will only match if non-quoted strings, as quoted strings will be consumed by the previous regex.
例如:
import re
non_quoted_words = "[^\\\\]((\")|('))(?(2)([^\"]|\\\")*|([^']|\\')*)[^\\\\]\\1|(\w+)"
quote = "This \"is an example ' \\\" of \" some 'text \\\" like wtf' \\\" is what I said."
print(quote)
print(re.findall(non_quoted_words,quote))
将返回:
This "is an example ' \" of " some 'text \" like wtf' \" is what I said.
[('', '', '', '', '', 'This'), ('"', '"', '', 'f', '', ''), ('', '', '', '', '', 'some'), ("'", '', "'", '', 't', ''), ('', '', '', '', '', 'is'), ('', '', '', '', '', 'what'), ('', '', '', '', '', 'I'), ('', '', '', '', '', 'said')]
推荐阅读
- c++ - 将日期从人类可读格式转换为纪元失败
- url-rewriting - 使用 .htaccess 为多个参数重写 URL
- windows - 每当触发管理员操作时,管理员用户的上次登录时间戳都会更改
- image - 如何在 cv2 VideoCapture 中使用预定义的起始帧从视频中提取每分钟的图像
- sql - Oracle - 从查询中获取上一个、当前和下一年
- flutter - 如何在 Flutter 的 MediaItem audio_service 包中集成 API
- spring-boot - 如何使用弹性文档字段进行计算?
- javascript - 将列表中的活动项目保存到 sessionStorage
- kotlin - 优化 for 循环以将项目添加到地图
- javascript - 我尝试发出明确的命令,但有些代码不起作用