首页 > 解决方案 > Python Regex - 提取包含相同关键字的多个句子

问题描述

import re

regex = r"[^.?!-]*(?<=[.?\s!-])\b(pfs)\b(?=[\s.?!-])[^.?!-]*[.?!-]"

test_str = "pfs alert conf . it is unlikely that we will sign it - pfs of $ 950 filed to driver - we are gathering information"

subst = ""

result = re.sub(regex, subst, test_str, 0, re.IGNORECASE | re.MULTILINE)

if result:
    print (result)

如我们所见,test_str 有两个带有关键字“pfs”的句子。但是,上面的python代码只能提取第二句'pfs of 950 filed to driver',如何修改它以提取'pfs alert conf'呢?

标签: pythonregex

解决方案


考虑nltk改用,imo它真的更适合这里:

from nltk import sent_tokenize

test_str = "pfs alert conf . it is unlikely that we will sign it - pfs of $ 950 filed to driver - we are gathering information. some junky words thereafter"
sentences = [sent for sent in sent_tokenize(test_str) if "pfs" in sent]
print(sentences)

这会产生(注意最后一句没有pfs):

['pfs alert conf .', 
 'it is unlikely that we will sign it - pfs of $ 950 filed to driver - we are gathering information.']

推荐阅读