首页 > 解决方案 > 在 R 中映射评论的主题

问题描述

我有两个数据集,评论数据主题数据

我的评论数据的输入代码

structure(list(Review = structure(2:1, .Label = c("Canteen Food could be improved", 
"Sports and physical exercise need to be given importance"), class = "factor")), class = "data.frame", row.names = c(NA, 
-2L))

我的主题数据的输入代码

structure(list(word = structure(2:1, .Label = c("canteen food", 
"sports and physical"), class = "factor"), Topic = structure(2:1, .Label = c("Canteen", 
"Sports "), class = "factor")), class = "data.frame", row.names = c(NA, 
-2L))

我的期望输出的 Dput ,我想查找主题数据中出现的单词并将其映射到评论数据

structure(list(Review = structure(2:1, .Label = c("Canteen Food could be improved", 
"Sports and physical exercise need to be given importance"), class = "factor"), 
    Topic = structure(2:1, .Label = c("Canteen", "Sports "), class = "factor")), class = "data.frame", row.names = c(NA, 
-2L))

标签: rdplyrtext-miningtmtidytext

解决方案


这里是业余的。我使用 base R 而不是 dplyr 来做到这一点,因为我不是最擅长连接函数的。

下面,初始化你的 dfs。我添加了更多示例以确保一切正常。还选择不使用因子,这会使以后分配字符串变得混乱。

# initialize your dfs
review <- data.frame("Review" = c("Canteen Food could be improved", 
                                  "Sports and physical exercise need to be given importance",
                                  "canteen food x2",
                                  "this is my sports and physical",
                                  "SPORTS AND PHYSICAL",
                                  "meme",
                                  "canteen and food",
                                  "this is my meme",
                                  "memethis"
                                  ),
                     stringsAsFactors = F)

topic <- data.frame("word" = c("canteen food", "sports and physical", "meme"), 
                    "Topic" = c("Canteen", "Sports", "meme_cat"),
                    stringsAsFactors = F)

然后只需使用一些嵌套的 for 循环来遍历您想要的单词,找到匹配的字符串,并分配相关的主题。并在 for 循环之前初始化所有内容。

# initialize new column to write into in loop
review <- cbind(review, "Topic" = rep(NA, nrow(review)))

# initialize before for loop
a <- rep(F, nrow(topic))

# loop over words in topic and find string matches in review. if so, assign review$topic = Topic
for (i in 1:nrow(topic)) {
  for(j in 1:nrow(review)) {
    a[j] <- grepl(topic$word[i], review$Review[j], ignore.case=T)
  }
  if (any(a)) {
    review$Topic[a] = topic$Topic[i]
  }

review
#                                                    Review    Topic
#1                           Canteen Food could be improved  Canteen
#2 Sports and physical exercise need to be given importance   Sports
#3                                          canteen food x2  Canteen
#4                           this is my sports and physical   Sports
#5                                      SPORTS AND PHYSICAL   Sports
#6                                                     meme meme_cat
#7                                         canteen and food     <NA>
#8                                          this is my meme meme_cat
#9                                                 memethis meme_cat

推荐阅读