首页 > 解决方案 > 如何在标点符号之间找到包含特定搜索词的句段

问题描述

我需要做一些自然语言处理,这需要我找到包含某个搜索词的所有句段。为此,我想获取包含搜索词的任何标点符号之间的所有单词。例如,我可以使用下面的代码轻松获取搜索词前后的单词。我也可以编写更复杂的逻辑来分解它,但我想弄清楚我是否可以使用一行正则表达式来做到这一点。

我尝试了一堆不同的正则表达式前瞻和后瞻模式组合,结果各不相同,但没有一个能达到我想要的结果。我可以得到两个相同标点符号之间的所有内容,例如两个句点之间的所有内容(即 starlist = re.findall(r'([^.] ?Star Trek[^.] .)',s) 问题似乎是当我尝试使用组时,例如 [.;:,]?有人知道如何解决这个问题吗?

s = 'Star Trek is an American media franchise, based on the science fiction television series; created by Star Trek legend Gene Roddenberry. The first television series, simply called Star Trek, and now referred to as "The Original Series", debuted in 1966 and aired for three seasons on NBC. It followed the Star Trek adventures of Captain James T. Kirk (William Shatner), and his crew aboard the starship USS Enterprise, a star exploration vessel built by the United Federation of Planets in the 23rd century. The Star Trek canon includes The Original Series: five Star Trek spin-off television series; an animated series; the Star Trek film franchise; and further adaptations in several media.'

starlist = re.findall('\w+ Star Trek \w+',s) #Successfully finds the word before and after

for x in starlist:
    print(x)

如果我使用上面的代码,我会得到以下结果:

由星际迷航传奇

星际迷航历险记

星际迷航佳能

五次星际迷航旋转

星际迷航电影

但是,我想得到以下结果:

星际迷航是美国媒体特许经营权

由星际迷航传奇人物 Gene Roddenberry 创建

简称星际迷航

它跟随詹姆斯·T·柯克船长(威廉·夏特纳)的星际迷航冒险

星际迷航佳能包括原始系列

五部星际迷航衍生电视剧

星际迷航电影专营权

标签: python-3.xregex-lookarounds

解决方案


繁荣,请破坏我的完美声誉得分。

import re

s = 'Star Trek is an American media franchise, based on the science fiction television series; created by Star Trek legend Gene Roddenberry. The first television series, simply called Star Trek, and now referred to as "The Original Series", debuted in 1966 and aired for three seasons on NBC. It followed the Star Trek adventures of Captain James T. Kirk (William Shatner), and his crew aboard the starship USS Enterprise, a star exploration vessel built by the United Federation of Planets in the 23rd century. The Star Trek canon includes The Original Series: five Star Trek spin-off television series; an animated series; the Star Trek film franchise; and further adaptations in several media.'

matches = re.findall('[.,]*[\w\s\']*Star Trek[\w\s\'-]*[,.]*',s)

for i,j in enumerate(matches):
    print(i,j)

推荐阅读