首页 > 解决方案 > 使用带有 grepl 和循环的名称列表从字符串中提取名称,并将它们添加到 R 中的新列

问题描述

我有一个数据集,其中有一列包含姓名,一列指示该人白天做了什么。我试图找出谁在那天使用 R 在我的数据集中遇到了谁。我创建了一个包含数据集中名称的向量,并在循环中使用 grepl 来识别名称出现在详细说明人们活动的列中的位置在数据集中。

name <- c("Dupont","Dupuy","Smith") 

activity <- c("On that day, he had lunch with Dupuy in London.", 
              "She had lunch with Dupont and then went to Brighton to meet Smith.", 
              "Smith remembers that he was tired on that day.")

met_with <- c("Dupont","Dupuy","Smith")

df<-data.frame(name, activity, met_with=NA)


for (i in 1:length(met_with)) {
df$met_with<-ifelse(grepl(met_with[i], df$activity), met_with[i], df$met_with)
}

然而,由于两个原因,该解决方案并不令人满意。当此人遇到多个其他人(例如 Dupuy 在我的示例中)时,我无法提取多个名称,并且我不能告诉 R 在使用该名称而不是代词时不要返回该人的姓名活动栏(例如史密斯)。

理想情况下,我希望 df 看起来像:

  name         activity                                            met_with                             
  Dupont       On that day, he had lunch with Dupuy in London.     Dupuy
  Dupuy        She had lunch with Dupont and then (...).           Dupont Smith
  Smith        Smith remembers that he was tired on that day.      NA

我正在清理字符串以构建边缘列表和节点列表,以便稍后进行网络分析。

谢谢

标签: rstringloopsgrepledge-list

解决方案


与@Gki 相同的逻辑,但使用stringr函数而mapply不是循环。

library(stringr)

pat <- str_c('\\b', df$name, '\\b', collapse = '|')
df$met_with <- mapply(function(x, y) str_c(setdiff(x, y), collapse = ' '), 
       str_extract_all(df$activity, pat), df$name)

df

#    name                                                           activity
#1 Dupont                    On that day, he had lunch with Dupuy in London.
#2  Dupuy She had lunch with Dupont and then went to Brighton to meet Smith.
#3  Smith                     Smith remembers that he was tired on that day.

#      met_with
#1        Dupuy
#2 Dupont Smith
#3             

推荐阅读