python-3.x - 如何在标点符号之间找到包含特定搜索词的句段
问题描述
我需要做一些自然语言处理,这需要我找到包含某个搜索词的所有句段。为此,我想获取包含搜索词的任何标点符号之间的所有单词。例如,我可以使用下面的代码轻松获取搜索词前后的单词。我也可以编写更复杂的逻辑来分解它,但我想弄清楚我是否可以使用一行正则表达式来做到这一点。
我尝试了一堆不同的正则表达式前瞻和后瞻模式组合,结果各不相同,但没有一个能达到我想要的结果。我可以得到两个相同标点符号之间的所有内容,例如两个句点之间的所有内容(即 starlist = re.findall(r'([^.] ?Star Trek[^.] .)',s) 问题似乎是当我尝试使用组时,例如 [.;:,]?有人知道如何解决这个问题吗?
s = 'Star Trek is an American media franchise, based on the science fiction television series; created by Star Trek legend Gene Roddenberry. The first television series, simply called Star Trek, and now referred to as "The Original Series", debuted in 1966 and aired for three seasons on NBC. It followed the Star Trek adventures of Captain James T. Kirk (William Shatner), and his crew aboard the starship USS Enterprise, a star exploration vessel built by the United Federation of Planets in the 23rd century. The Star Trek canon includes The Original Series: five Star Trek spin-off television series; an animated series; the Star Trek film franchise; and further adaptations in several media.'
starlist = re.findall('\w+ Star Trek \w+',s) #Successfully finds the word before and after
for x in starlist:
print(x)
如果我使用上面的代码,我会得到以下结果:
由星际迷航传奇
星际迷航历险记
星际迷航佳能
五次星际迷航旋转
星际迷航电影
但是,我想得到以下结果:
星际迷航是美国媒体特许经营权
由星际迷航传奇人物 Gene Roddenberry 创建
简称星际迷航
它跟随詹姆斯·T·柯克船长(威廉·夏特纳)的星际迷航冒险
星际迷航佳能包括原始系列
五部星际迷航衍生电视剧
星际迷航电影专营权
解决方案
繁荣,请破坏我的完美声誉得分。
import re
s = 'Star Trek is an American media franchise, based on the science fiction television series; created by Star Trek legend Gene Roddenberry. The first television series, simply called Star Trek, and now referred to as "The Original Series", debuted in 1966 and aired for three seasons on NBC. It followed the Star Trek adventures of Captain James T. Kirk (William Shatner), and his crew aboard the starship USS Enterprise, a star exploration vessel built by the United Federation of Planets in the 23rd century. The Star Trek canon includes The Original Series: five Star Trek spin-off television series; an animated series; the Star Trek film franchise; and further adaptations in several media.'
matches = re.findall('[.,]*[\w\s\']*Star Trek[\w\s\'-]*[,.]*',s)
for i,j in enumerate(matches):
print(i,j)
推荐阅读
- android - 布局在预览中很好,但在虚拟/真实设备中混乱
- c# - 在 ASP.NET MVC Web 应用程序中运行控制台应用程序
- android - Android 模拟器:在平板电脑模拟器上安装 google play 服务
- java - 快速获取 YYYYMMDD 格式的日期数的方法?
- apache-flink - Flink 部署生产标准
- javascript - 从 HTML 内容中提取平铺图像
- visual-studio-code - 如何在 VS Code 中编辑默认日志片段
- c++ - 使用向量进行 C++ 类初始化
- java - 在flyingsacer pdf中重复表格标题和thead
- ios - 在 Swift 中返回后更改对象