首页 > 解决方案 > 如何查找和绘制多个短语的频率总计?

问题描述

我有语料库,我试图找出按年份汇总的多个短语的频率并绘制它。例如,如果短语“美国经济”和“加拿大经济”在 2004 年各被提及 2 次,我希望这给出 2004 年的频率为 4。

我已经设法为单个令牌做到这一点,但在尝试短语时遇到了麻烦。这是我用来为单个令牌执行的代码。

a_corpus <- corpus(df, text = "text")

my_dict <- dictionary(list(america = c("America", "President")))
                      
freq_grouped_creators <- textstat_frequency(dfm(tokens(a_corpus)), 
                               groups = a_corpus$Year)

freq_word_creators <- subset(freq_grouped_creators, freq_grouped_creators$feature %in% my_dict$america)

# collapsing rows by year to total frequencies for tokens
freq_word_creators_2 <- freq_word_creators %>% 
                           group_by(group) %>%
                           summarize(Sum_frequency = sum(frequency))

# plotting
ggplot(freq_word_creators_2, aes(x = group, y = 
    Sum_frequency)) +
    geom_point() +
    scale_y_continuous(limits = c(0, 300), breaks = c(seq(0, 300, 30))) +
    xlab(NULL) +
    ylab("Frequency") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

标签: rnlpquantedafrequency-analysis

解决方案


无需在dplyr中操作频率- 一种更简单的方法是选择短语,然后创建一个 dfm,将其转换为 data.frame 以直接与ggplot2一起使用。

library("quanteda")
## Package version: 3.0.9000
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
library("quanteda.textstats")

a_corpus <- tail(data_corpus_inaugural, 10)

economic_phrases <- c("middle class", "social security", "strong economy")
toks <- tokens(a_corpus)
toks <- tokens_compound(toks, phrase(economic_phrases), concatenator = " ") %>%
  tokens_keep(economic_phrases)
dfmat <- dfm(toks)
dfmat
## Document-feature matrix of: 10 documents, 2 features (65.00% sparse) and 4 docvars.
##               features
## docs           middle class social security
##   1985-Reagan             0               0
##   1989-Bush               0               0
##   1993-Clinton            0               0
##   1997-Clinton            2               0
##   2001-Bush               0               1
##   2005-Bush               0               1
## [ reached max_ndoc ... 4 more documents ]

freq_word_creators_2 <- data.frame(convert(dfmat, to = "data.frame"), Year = dfmat$Year)

# plotting
library("ggplot2")
ggplot(freq_word_creators_2, aes(x = Year, y = middle.class)) +
  geom_point() +
  # scale_y_continuous(limits = c(0, 300), breaks = c(seq(0, 300, 30))) +
  xlab(NULL) +
  ylab("Frequency") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))


推荐阅读