r - R正则表达式从文本文件中提取电视节目名称

问题描述

我正在尝试使用 R 从 txt 文件中提取电视节目名称。

我已经加载了 txt 并将其分配给一个名为 txt 的变量。现在我正在尝试使用正则表达式来提取我想要的信息。

我想提取的信息模式是

SHOW: Game of Thrones 7:00 PM EST
SHOW: The Outsider 3:00 PM EST
SHOW: Don't Be a Menace to South Central While Drinking Your Juice In The Hood 10:00 AM EST

等等。大约有 320 个节目，我想提取所有 320 个节目。

到目前为止，我已经想出了这个。

pattern <- "SHOW:\\s\\w*"
str_extract_all(txt, pattern3)

但是，它并没有像我预期的那样提取整个标题名称。（例如：它将提取“SHOW: Game”而不是“SHOW: Game of Thrones”。如果我要提取那个节目，我只会使用"SHOW:\\s\\w*\\s\\w*\\s\\w*匹配字数，但我想提取 txt 中的所有节目，包括更长和更短的名称。

我应该如何编写正则表达式以获得预期的结果？

标签： rregextelevision

这是否有效，使用环顾四周：

str_extract(st, '(?<=SHOW: )(.*)(?= \\d{1,2}:.. [PA]M ...)')
[1] "Game of Thrones"                                                         
[2] "The Outsider"                                                            
[3] "Don't Be a Menace to South Central While Drinking Your Juice In The Hood"

显示：

str_extract(st, '(.*)(?= \\d{1,2}:.. [PA]M ...)')
[1] "SHOW: Game of Thrones"                                                         
[2] "SHOW: The Outsider"                                                            
[3] "SHOW: Don't Be a Menace to South Central While Drinking Your Juice In The Hood"

数据：

st
[1] "SHOW: Game of Thrones 7:00 PM EST"                                                          
[2] "SHOW: The Outsider 3:00 PM EST"                                                             
[3] "SHOW: Don't Be a Menace to South Central While Drinking Your Juice In The Hood 10:00 AM EST"

r - R正则表达式从文本文件中提取电视节目名称

问题描述

解决方案

推荐阅读