首页 > 解决方案 > 如何在聚合数据上使用 quanteda?

问题描述

考虑这个例子

tibble(text = c('a grande latte with soy milk',
                'black coffee no room'),
       repetition = c(100, 2)) 
# A tibble: 2 x 2
  text                         repetition
  <chr>                             <dbl>
1 a grande latte with soy milk        100
2 black coffee no room                  2

数据意味着该句子a grande latte with soy milk在我的数据集中出现了 100 次。当然,存储冗余是浪费内存,这就是我有repetition变量的原因。

尽管如此,我还是希望有dtmfrom quanteda 来反映这一点,因为 dfm 的稀疏性给了我一些空间来保存这些信息。也就是说,dfm 中的第一个文本如何仍然有 100 行?仅使用以下代码不repetition考虑

tibble(text = c('a grande latte with soy milk',
                'black coffee no room'),
       repetition = c(100, 2)) %>% 
  corpus() %>% 
  tokens() %>% 
  dfm()
Document-feature matrix of: 2 documents, 10 features (50.0% sparse).
2 x 10 sparse Matrix of class "dfm"
       features
docs    a grande latte with soy milk black coffee no room
  text1 1      1     1    1   1    1     0      0  0    0
  text2 0      0     0    0   0    0     1      1  1    1

标签: rquanteda

解决方案


假设您data.frame的名称为 df1,您可以使用cbind向 dfm 添加一列。但这可能不会给您所需的结果。下面的其他两个选项可能更好。

绑定

df1 <- tibble(text = c('a grande latte with soy milk',
                'black coffee no room'),
       repetition = c(100, 2))

my_dfm <- df1 %>%  
  corpus() %>% 
  tokens() %>% 
  dfm() %>% 
  cbind(repetition = df1$repetition) # add column to dfm with name repetition

Document-feature matrix of: 2 documents, 11 features (45.5% sparse).
2 x 11 sparse Matrix of class "dfm"
       features
docs    a grande latte with soy milk black coffee no room repetition
  text1 1      1     1    1   1    1     0      0  0    0        100
  text2 0      0     0    0   0    0     1      1  1    1          2

文档变量

您还可以通过该docvars函数添加数据,然后将数据添加到 dfm,但在 dfm-class 插槽中隐藏得更多(可使用 @ 访问)。

docvars(my_dfm, "repetition") <- df1$repetition
docvars(my_dfm)

      repetition
text1        100
text2          2

乘法

使用乘法:

my_dfm * df1$repetition

Document-feature matrix of: 2 documents, 10 features (50.0% sparse).
2 x 10 sparse Matrix of class "dfm"
       features
docs      a grande latte with soy milk black coffee no room
  text1 100    100   100  100 100  100     0      0  0    0
  text2   0      0     0    0   0    0     2      2  2    2

推荐阅读