首页 > 解决方案 > 计算 R 中多列的词频

问题描述

我在 R 中有一个数据框,其中包含多列和多字文本响应,看起来像这样:

1a        1b             1c       2a          2b             2c
student   job prospects  money    professors  students       campus
future    career         unsure   my grades   opportunities  university
success   reputation     my job   earnings    courses        unsure

我希望能够计算第 1a、1b 和 1c 列以及 2a、2b 和 2b 组合中单词的频率。

目前,我正在使用此代码分别计算每列中的词频。

data.frame(table(unlist(strsplit(tolower(dat$1a), " "))))

理想情况下,我希望能够将两组列组合成两列,然后使用相同的代码来计算词频,但我对其他选项持开放态度。

合并的列看起来像这样:

1              2
student        professors
future         my grades
success        earnings
job prospects  students
career         opportunities
reputation     courses
money          campus
unsure         university
my job         unsure

标签: rdataframetextnlp

解决方案


这是一种使用dplyrtidyr包的方法。仅供参考,应该避免列名以数字开头。从长远来看,为它们命名a1a2......将使事情变得更容易。

df %>% 
  gather(variable, value) %>% 
  mutate(variable = substr(variable, 1, 1)) %>% 
  mutate(id = ave(variable, variable, FUN = seq_along)) %>%
  spread(variable, value)

  id             1             2
1  1       student    professors
2  2        future     my grades
3  3       success      earnings
4  4 job prospects      students
5  5        career opportunities
6  6    reputation       courses
7  7         money        campus
8  8        unsure    university
9  9        my job        unsure

数据 -

df <- structure(list(`1a` = c("student", "future", "success"), `1b` = c("job prospects", 
"career", "reputation"), `1c` = c("money", "unsure", "my job"
), `2a` = c("professors", "my grades", "earnings"), `2b` = c("students", 
"opportunities", "courses"), `2c` = c("campus", "university", 
"unsure")), .Names = c("1a", "1b", "1c", "2a", "2b", "2c"), class = "data.frame", row.names = c(NA, 
-3L))

推荐阅读