首页 > 解决方案 > How to transform a Document Term Matrix in R?

问题描述

Hello I have a document term matrix and I transformed it with the tidy() function and it works perfect. I want to plot a word cloud based on the frequency of a word. So my transformed table looks like this:

> head(Wcloud.Data)
# A tibble: 6 x 3
  document term       count
  <chr>    <chr>      <dbl>
1 1        accept         1
2 1        access         1
3 1        accomplish     1
4 1        account        4
5 1        accur          2
6 1        achiev         1

I have 33,647,383 observations so its a very big dataframe. If I use the max() function I am getting a very high number (64116) but no word in my dataframe has a frequency of 64116. Also if I plot the dataframe in shiny with wordcloud() it plots same words several times. Also if I want to sort my column count its not working - sort(Wcloud.Data$count,decreasing = TRUE). So something is not correct but I dont know, what and how to solve it. Somebody has any idea?

Thas the summary of my document term matrix, before transform it into a dataframe:

> observations.tf
<<DocumentTermMatrix (documents: 76717, terms: 4234)>>
Non-/sparse entries: 33647383/291172395
Sparsity           : 90%
Maximal term length: 15
Weighting          : term frequency (tf)

Update: I add a picture of my dataframe

Dataframe

Output

标签: rdataframedatasettransformationword-cloud

解决方案


使用dplyr您可以执行以下操作:

library("tm")
library("SnowballC")
library("wordcloud")
library("RColorBrewer")

Wcloud.Data<- data.frame(Document= c(rep(1,6)), 
                         term = c("accept", "access","accomplish", "account", "accur", "achiev"),
                         count = c(1,1,1,4,2,1))

Data<-Wcloud.Data %>% 
  group_by(term) %>% 
  summarise(Frequency = sum(count))
set.seed(1234)
wordcloud(words = Data$term, freq = Data$Frequency, min.freq = 1,
          max.words=200, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

在此处输入图像描述

另一方面,库quanteda可以tibble帮助您创建术语频率矩阵。我会给你一个例子来使用它:

library(tibble)
library(quanteda)
Data <- data_frame(text = c("Chinese Beijing Chinese",
                              "Chinese Chinese Shanghai",
                              "this is china",
                              "china is here",
                              'hello china',
                              "Chinese Beijing Chinese",
                              "Chinese Chinese Shanghai",
                              "this is china",
                              "china is here",
                              'hello china',
                              "Kyoto Japan",
                              "Tokyo Japan Chinese",
                              "Kyoto Japan",
                              "Tokyo Japan Chinese",
                              "Kyoto Japan",
                              "Tokyo Japan Chinese",
                              "Kyoto Japan",
                              "Tokyo Japan Chinese",
                              'japan'))
DocTerm <- quanteda::dfm(Data$text)
DocTerm
# Document-feature matrix of: 19 documents, 11 features (78.5% sparse).
# 19 x 11 sparse Matrix of class "dfm"
# features
# docs     chinese beijing shanghai this is china here hello kyoto japan tokyo
# text1        2       1        0    0  0     0    0     0     0     0     0
# text2        2       0        1    0  0     0    0     0     0     0     0
# text3        0       0        0    1  1     1    0     0     0     0     0
# text4        0       0        0    0  1     1    1     0     0     0     0
# text5        0       0        0    0  0     1    0     1     0     0     0
# text6        2       1        0    0  0     0    0     0     0     0     0
# text7        2       0        1    0  0     0    0     0     0     0     0
# text8        0       0        0    1  1     1    0     0     0     0     0
# text9        0       0        0    0  1     1    1     0     0     0     0
# text10       0       0        0    0  0     1    0     1     0     0     0
# text11       0       0        0    0  0     0    0     0     1     1     0
# text12       1       0        0    0  0     0    0     0     0     1     1
# text13       0       0        0    0  0     0    0     0     1     1     0
# text14       1       0        0    0  0     0    0     0     0     1     1
# text15       0       0        0    0  0     0    0     0     1     1     0
# text16       1       0        0    0  0     0    0     0     0     1     1
# text17       0       0        0    0  0     0    0     0     1     1     0
# text18       1       0        0    0  0     0    0     0     0     1     1
# text19       0       0        0    0  0     0    0     0     0     1     0

Mat<-quanteda::convert(DocTerm,"data.frame")[,2:ncol(DocTerm)] # Converting to a Dataframe without taking into account the text variable
Result<- colSums(Mat) # This is what you are interested in
names(Result)<-colnames(Mat)
# > Result
# chinese  beijing shanghai     this       is    china     here    hello    kyoto    japan 
# 24        4        4        4        8       12        4        4        8       18 

推荐阅读