regex - python regex：如何从某个单词到文本末尾的最小子字符串？

问题描述

我正在分析一个文本，我想提取从某个单词出现到文本结尾的最小子字符串。我的特殊问题是这个词可以在我的文本的几个部分中。

我尝试了以下方法：

pattern = re.compile('(word)(.*?)$', re.DOTALL)
result = re.search(pattern, MY_TEXT).group()

我的问题是，这不会导致返回尽可能小的字符串，而是在文本中找到最大的字符串（即：word直到文本结尾的第一次出现，而不是最后一次出现）。我确信在第二个括号内添加?字符.*会解决问题，但事实并非如此。

示例输入：

text = "Pokémon is a media franchise managed by The Pokémon Company, a Japanese consortium between Nintendo, Game Freak, and Creatures.\nThe franchise began as Pokémon Red and Green (later released outside of Japan as Pokémon Red and Blue)."
word = 'Pokémon'

我希望我的结果是字符串：Pokémon Red and Blue).，但现在我得到了整个文本。

我怎样才能得到我所期望的？提前致谢。

标签： regexpython-3.x

您当前的模式(Pokémon)(.*?)$有 2 个捕获组，它只会匹配第一次出现的，word因为第二组随后匹配直到字符串的末尾。

要到达最后一个单词，您可以使用.*Pokémonas .*will first match 直到字符串的末尾，然后回溯直到它可以 fit Pokémon。

然后字符串的其余部分将通过以下匹配.*值在第一个捕获组中。

^.*(Pokémon .*)$

正则表达式演示| Python 演示

创建更动态的模式

text = "Pokémon is a media franchise managed by The Pokémon Company, a Japanese consortium between Nintendo, Game Freak, and Creatures.\nThe franchise began as Pokémon Red and Green (later released outside of Japan as Pokémon Red and Blue)."
word = "and"
pattern = r"^.*(" + re.escape(word) + ".*)$"
regex = re.compile(pattern, re.DOTALL)
result = re.search(regex, text).group(1)
print(result)

结果

和蓝色）。

如果这个词也可以是句子中的最后一个词，你可以断言右边的不是非空白字符(?!\S)，使用否定的前瞻。

^.*(Pokémon(?!\S).*)$

正则表达式演示

regex - python regex：如何从某个单词到文本末尾的最小子字符串？

问题描述

解决方案

推荐阅读