首页 > 解决方案 > 对包含来自 R 中引用表的字符串的行求和

问题描述

对于在表中作为行存在的字符串列表,我想确定这些字符串在 R 中另一个数据表的行中出现的频率。同时,我想对包含这些字符串的行的值求和。

例如,我的包含字符串列表的参考表看起来像这样:

+-----------------------------+
|String                       |
+-----------------------------+
|Dixon                        |
+-----------------------------+
|Nina Kraviz                  |
+-----------------------------+
|DJ Tennis                    |
+-----------------------------+

我要分析的表看起来像这样:

+--------------------------------+
|String                |Score    |
+--------------------------------+
|Nina Kraviz @ Hyde    |100      |
+--------------------------------+
|DJ Tennis?            |200      |
+--------------------------------+
|From Dixon            |100      |
+--------------------------------+
|From Kevin Saunderson |100      |
+--------------------------------+
|Dixon                 |300      |
+--------------------------------+
|Nina Kraviz           |200      |
+--------------------------------+

我希望我的结果表如下所示:

+---------------------------------+
|String             |Score        |
+---------------------------------+
|Dixon              |400          |
+---------------------------------+
|Nina Kraviz        |300          |
+---------------------------------+
|DJ Tennis          |200          |
+---------------------------------+

我尝试过使用 n-gram 和标记化,但它的工作方式并不容易,因为艺术家的名字通常可以包含 1、2 或 3 个单词。任何帮助,将不胜感激。

标签: rstringjoinnlpsum

解决方案


我们可以filter基于部分匹配的第二个data.frame的行

library(dplyr)
library(stringr)
pat <- str_c("\\b(", str_c(df1$String, collapse="|"), ")\\b")
df2 %>%
     group_by(String = str_extract(String, pat)) %>%
     filter(!is.na(String)) %>%
     summarise(Score = sum(Score, na.rm = TRUE))
# A tibble: 3 x 2
#  String      Score
#  <chr>       <dbl>
#1 Dixon         400
#2 DJ Tennis     200
#3 Nina Kraviz   300

数据

df1 <- structure(list(String = c("Dixon", "Nina Kraviz", "DJ Tennis"
)), class = "data.frame", row.names = c(NA, -3L))

df2 <- structure(list(String = c("Nina Kraviz @ Hyde", "DJ Tennis?", 
"From Dixon", "From Kevin Saunderson", "Dixon", "Nina Kraviz"
), Score = c(100, 200, 100, 100, 300, 200)), class = "data.frame", 
row.names = c(NA, 
-6L))

推荐阅读