首页 > 解决方案 > R - 正则表达式以提取包含关键字的括号之间的文本

问题描述

如果关键字在括号内,我需要从括号之间提取文本。

所以如果我有一个看起来像这样的字符串:

('one', 'CARDINAL'), ('Castro', 'PERSON'), ('Latin America', 'LOC'), ('Somoza', 'PERSON')

我的关键字是“LOC”,我只想提取('Latin America', 'LOC'),而不是其他人。

帮助表示赞赏!

这是我的数据集的一个样本,一个 csv 文件:

,speech_id,sentence,date,speaker,file,parsed_text,named_entities
0,950094636,Let me state that the one sure way we can make it easy for Castro to continue to gain converts in Latin America is if we continue to support regimes of the ilk of the Somoza family,19770623,Mr. OBEY,06231977.txt,Let me state that the one sure way we can make it easy for Castro to continue to gain converts in Latin America is if we continue to support regimes of the ilk of the Somoza family,"[('one', 'CARDINAL'), ('Castro', 'PERSON'), ('Latin America', 'LOC'), ('Somoza', 'PERSON')]"
1,950094636,That is how we encourage the growth of communism,19770623,Mr. OBEY,06231977.txt,That is how we encourage the growth of communism,[]
2,950094636,That is how we discourage the growth of democracy in Latin America,19770623,Mr. OBEY,06231977.txt,That is how we discourage the growth of democracy in Latin America,"[('Latin America', 'LOC')]"
3,950094636,Mr Chairman,19770623,Mr. OBEY,06231977.txt,Mr Chairman,[]
4,950094636,given the speeches I have made lately about the press,19770623,Mr. OBEY,06231977.txt,given the speeches I have made lately about the press,[]
5,950094636,I am not one,19770623,Mr. OBEY,06231977.txt,I am not one,[]
6,950094636,I suppose,19770623,Mr. OBEY,06231977.txt,I suppose,[]

我试图用 LOC 这个词提取括号:

regex <- "(?=\\().*? \'LOC.*?(?<=\\))"
  
  
filtered_df$clean_NE <- str_extract_all(filtered_df$named_entities, regex)

上面的正则表达式不起作用。谢谢!

标签: rregex

解决方案


您可以使用

str_extract_all(filtered_df$named_entities, "\\([^()]*'LOC'[^()]*\\)")

请参阅正则表达式演示详情

  • \(- 一个(字符
  • [^()]*(- 除了and之外的零个或多个字符)
  • 'LOC'- 一个'LOC'字符串
  • [^()]*(- 除了and之外的零个或多个字符)
  • \)- 一个 )字符。

请参阅在线 R 演示

library(stringr)
x <- "[('one', 'CARDINAL'), ('Castro', 'PERSON'), ('Latin America', 'LOC'), ('Somoza', 'PERSON')]"
str_extract_all(x, "\\([^()]*'LOC'[^()]*\\)")
# => [1] "('Latin America', 'LOC')"

作为获得的奖励解决方案Latin America,您可以使用

str_extract_all(x, "[^']+(?=',\\s*'LOC'\\))")
# => [1] "Latin America"

在这里,[^']+(?=',\s*'LOC'\))匹配一个或多个字符,而不是'后跟',、零个或多个空格,然后是'LOC')字符串。


推荐阅读