首页 > 解决方案 > 将词对与 widyr 包相关联,但没有要分组的部分

问题描述

我正在尝试学习 widyr 包中的 pairwise_cor() 函数,并按照本书中的示例,使用 r进行文本挖掘。在示例中,他们将一本书作为一个部分来计算相关性。但是,我想查看所有书籍中的单词,而不是通过比较书籍之间的单词。如果我想查看整个书籍数据集中单词关联的频率,查看文本中单词的关联程度,而不是按部分拆分,我该怎么做?非常感谢

library(dplyr)
library(tidytext)
library(janeaustenr)

austen_section_words <- austen_books() %>%
  filter(book == "Pride & Prejudice") %>% ### Can I remove the filter of a specific book?
  mutate(section = row_number() %/% 10) %>%
  filter(section > 0) %>%
  unnest_tokens(word, text) %>%
  filter(!word %in% stop_words$word)
austen_section_words

library(widyr)

# count words co-occuring within sections
word_pairs <- austen_section_words %>%
  pairwise_count(word, section, sort = TRUE) ### here the section must impact the correlations, I'd like to see all correlations, not by book 

word_pairs %>%
  filter(item1 == "darcy")

# we need to filter for at least relatively common words first
word_cors <- austen_section_words %>%
  group_by(word) %>%
  filter(n() >= 20) %>%
  pairwise_cor(word, section, sort = TRUE)

标签: rtext-miningtidytext

解决方案


推荐阅读