首页 > 解决方案 > 在 R 中,我如何计算语料库中的特定单词?

问题描述

我需要计算特定单词的频率。很多话。我知道如何通过将所有单词放在一个组中来做到这一点(见下文),但我想获得每个特定单词的计数。

这就是我目前所拥有的:

library(quanteda)
#function to count 
strcount <- function(x, pattern, split){unlist(lapply(strsplit(x, split),function(z) na.omit(length(grep(pattern, z)))))}
txt <- "Forty-four Americans have now taken the presidential oath. The words have been spoken during rising tides of prosperity and the still waters of peace. Yet, every so often the oath is taken amidst gathering clouds and raging storms. At these moments, America has carried on not simply because of the skill or vision of those in high office, but because We the People have remained faithful to the ideals of our forbearers, and true to our founding documents."
df<-data.frame(txt)
mydict<-dictionary(list(all_terms=c("clouds","storms")))
corp <- corpus(df, text_field = 'txt')
#count terms and save output to "overview"
overview<-dfm(corp,dictionary = mydict)
overview<-convert(overview, to ='data.frame')

如您所见,“云”和“风暴”的计数在生成的 data.frame 中的“all_terms”类别中。有没有一种简单的方法可以在各个列中获取“mydict”中所有术语的计数,而无需为每个单独的术语编写代码?

E.g.
clouds, storms
1, 1

Rather than 
all_terms
2

标签: rnlpdata-sciencequanteda

解决方案


您想将字典值用作patternin tokens_select(),而不是在查找函数中使用它们,这就是这样dfm(x, dictionary = ...)做的。就是这样:

library("quanteda")
## Package version: 2.1.2

txt <- "Forty-four Americans have now taken the presidential oath. The words have been spoken during rising tides of prosperity and the still waters of peace. Yet, every so often the oath is taken amidst gathering clouds and raging storms. At these moments, America has carried on not simply because of the skill or vision of those in high office, but because We the People have remained faithful to the ideals of our forbearers, and true to our founding documents."

mydict <- dictionary(list(all_terms = c("clouds", "storms")))

这将创建 dfm,其中每列是术语,而不是字典键:

dfmat <- tokens(txt) %>%
  tokens_select(mydict) %>%
  dfm()

dfmat
## Document-feature matrix of: 1 document, 2 features (0.0% sparse).
##        features
## docs    clouds storms
##   text1      1      1

您可以通过两种方式将其转换为计数的 data.frame:

convert(dfmat, to = "data.frame")
##   doc_id clouds storms
## 1  text1      1      1

textstat_frequency(dfmat)
##   feature frequency rank docfreq group
## 1  clouds         1    1       1   all
## 2  storms         1    1       1   all

虽然字典是 a 的有效输入pattern(请参见 参考资料?pattern),但您也可以将值的字符向量输入到tokens_select()

# no need for dictionary
tokens(txt) %>%
  tokens_select(c("clouds", "storms")) %>%
  dfm()
## Document-feature matrix of: 1 document, 2 features (0.0% sparse).
##        features
## docs    clouds storms
##   text1      1      1

推荐阅读