python - 用于捕获科学引文的 RegEx
问题描述
我正在尝试捕获其中至少包含一位数字的文本括号(想想引文)。这是我现在的正则表达式,它工作正常:https ://regex101.com/r/oOHPvO/5
\((?=.*\d).+?\)
所以我希望它能够捕获(Author 2000)
,(2000)
但不是(Author)
。
我正在尝试使用 python 来捕获所有这些括号,但在 python 中,即使它们没有数字,它也会捕获括号中的文本。
import re
with open('text.txt') as f:
f = f.read()
s = "\((?=.*\d).*?\)"
citations = re.findall(s, f)
citations = list(set(citations))
for c in citations:
print (c)
任何想法我做错了什么?
解决方案
可能处理此表达式的最可靠方法可能是在您的表达式可能增长时添加边界。例如,我们可以尝试创建 char 列表,我们希望在其中收集所需的数据:
(?=\().([a-z]+)([\s,;]+?)([0-9]+)(?=\)).
演示
测试
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"(?=\().([a-z]+)([\s,;]+?)([0-9]+)(?=\))."
test_str = "some text we wish before (Author) some text we wish after (Author 2000) some text we wish before (Author) some text we wish after (Author, 2000) some text we wish before (Author) some text we wish after (Author 2000) some text we wish before (Author) some text we wish after (Author; 2000)"
matches = re.finditer(regex, test_str, re.MULTILINE | re.IGNORECASE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
演示
const regex = /(?=\().([a-z]+)([\s,;]+?)([0-9]+)(?=\))./mgi;
const str = `some text we wish before (Author) some text we wish after (Author 2000) some text we wish before (Author) some text we wish after (Author, 2000) some text we wish before (Author) some text we wish after (Author 2000) some text we wish before (Author) some text we wish after (Author; 2000)`;
let m;
while ((m = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
// The result can be accessed through the `m`-variable.
m.forEach((match, groupIndex) => {
console.log(`Found match, group ${groupIndex}: ${match}`);
});
}
正则表达式电路
jex.im可视化正则表达式:
推荐阅读
- git - 查看 Visual Studio Git 提交时,如何打开文件的本地副本?
- vba - 如何循环并捕获访问列表框中的每个选定项目?
- c++ - 在 RAII 构造中修改 RVO 值是否安全?
- typo3 - 在打字稿中从 flexform 中读取数据
- javascript - 如何强制 JSON.parse 抛出数字?
- javascript - 防止默认 Firefox 在页面加载时滚动到锚标记
- javascript - 如何删除字符串中额外出现的字母?
- php - 当php OPCODE被zend解释时,真正执行的是什么?
- axapta - 在我执行完整 CIL 时收到这些错误消息
- javascript - 操作不会触发 redux/redux thunk