首页 > 解决方案 > 从文本中删除大量字符串

问题描述

假设

txt='Daniel Johnson and Ana Hickman are friends. They know each other for a long time. Daniel Johnson is a professor and Ana Hickman is writer.'

是一大段文字,我想删除一大串字符串,例如

removalLists=['Daniel Johnson','Ana Hickman']

从他们。我的意思是我想将列表中的所有元素替换为

' '

我知道我可以使用循环轻松做到这一点,例如

for string in removalLists:
    txt=re.sub(string,' ',txt)

我想知道我是否可以更快地做到这一点。

标签: regexpython-3.x

解决方案


一种方法是生成一个单一的正则表达式模式,它是替换术语的交替。所以,我建议使用以下正则表达式模式,例如:

\bDaniel Johnson\b|\bAna Hickman\b

为了生成这个,我们可以首先用单词边界 ( ) 包装每个术语\b。然后,将列表折叠为单个字符串,|用作分隔符。最后,我们可以用re.sub一个空格替换所有出现的任何术语。

txt = 'Daniel Johnson and Ana Hickman are friends. They know each other for a long time. Daniel Johnson is a professor and Ana Hickman is writer.'
removalLists = ['Daniel Johnson','Ana Hickman']

regex = '|'.join([r'\b' + s + r'\b' for s in removalLists])
output = re.sub(regex, " ", txt)

print(output)

  and   are friends. They know each other for a long time.   is a professor and   is writer.

推荐阅读