r - How to transform a Document Term Matrix in R?
问题描述
Hello I have a document term matrix and I transformed it with the tidy()
function and it works perfect. I want to plot a word cloud based on the frequency of a word. So my transformed table looks like this:
> head(Wcloud.Data)
# A tibble: 6 x 3
document term count
<chr> <chr> <dbl>
1 1 accept 1
2 1 access 1
3 1 accomplish 1
4 1 account 4
5 1 accur 2
6 1 achiev 1
I have 33,647,383 observations so its a very big dataframe. If I use the max()
function I am getting a very high number (64116) but no word in my dataframe has a frequency of 64116. Also if I plot the dataframe in shiny with wordcloud()
it plots same words several times. Also if I want to sort my column count
its not working - sort(Wcloud.Data$count,decreasing = TRUE)
. So something is not correct but I dont know, what and how to solve it. Somebody has any idea?
Thas the summary of my document term matrix, before transform it into a dataframe:
> observations.tf
<<DocumentTermMatrix (documents: 76717, terms: 4234)>>
Non-/sparse entries: 33647383/291172395
Sparsity : 90%
Maximal term length: 15
Weighting : term frequency (tf)
Update: I add a picture of my dataframe
解决方案
使用dplyr
您可以执行以下操作:
library("tm")
library("SnowballC")
library("wordcloud")
library("RColorBrewer")
Wcloud.Data<- data.frame(Document= c(rep(1,6)),
term = c("accept", "access","accomplish", "account", "accur", "achiev"),
count = c(1,1,1,4,2,1))
Data<-Wcloud.Data %>%
group_by(term) %>%
summarise(Frequency = sum(count))
set.seed(1234)
wordcloud(words = Data$term, freq = Data$Frequency, min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
另一方面,库quanteda
可以tibble
帮助您创建术语频率矩阵。我会给你一个例子来使用它:
library(tibble)
library(quanteda)
Data <- data_frame(text = c("Chinese Beijing Chinese",
"Chinese Chinese Shanghai",
"this is china",
"china is here",
'hello china',
"Chinese Beijing Chinese",
"Chinese Chinese Shanghai",
"this is china",
"china is here",
'hello china',
"Kyoto Japan",
"Tokyo Japan Chinese",
"Kyoto Japan",
"Tokyo Japan Chinese",
"Kyoto Japan",
"Tokyo Japan Chinese",
"Kyoto Japan",
"Tokyo Japan Chinese",
'japan'))
DocTerm <- quanteda::dfm(Data$text)
DocTerm
# Document-feature matrix of: 19 documents, 11 features (78.5% sparse).
# 19 x 11 sparse Matrix of class "dfm"
# features
# docs chinese beijing shanghai this is china here hello kyoto japan tokyo
# text1 2 1 0 0 0 0 0 0 0 0 0
# text2 2 0 1 0 0 0 0 0 0 0 0
# text3 0 0 0 1 1 1 0 0 0 0 0
# text4 0 0 0 0 1 1 1 0 0 0 0
# text5 0 0 0 0 0 1 0 1 0 0 0
# text6 2 1 0 0 0 0 0 0 0 0 0
# text7 2 0 1 0 0 0 0 0 0 0 0
# text8 0 0 0 1 1 1 0 0 0 0 0
# text9 0 0 0 0 1 1 1 0 0 0 0
# text10 0 0 0 0 0 1 0 1 0 0 0
# text11 0 0 0 0 0 0 0 0 1 1 0
# text12 1 0 0 0 0 0 0 0 0 1 1
# text13 0 0 0 0 0 0 0 0 1 1 0
# text14 1 0 0 0 0 0 0 0 0 1 1
# text15 0 0 0 0 0 0 0 0 1 1 0
# text16 1 0 0 0 0 0 0 0 0 1 1
# text17 0 0 0 0 0 0 0 0 1 1 0
# text18 1 0 0 0 0 0 0 0 0 1 1
# text19 0 0 0 0 0 0 0 0 0 1 0
Mat<-quanteda::convert(DocTerm,"data.frame")[,2:ncol(DocTerm)] # Converting to a Dataframe without taking into account the text variable
Result<- colSums(Mat) # This is what you are interested in
names(Result)<-colnames(Mat)
# > Result
# chinese beijing shanghai this is china here hello kyoto japan
# 24 4 4 4 8 12 4 4 8 18
推荐阅读
- python - 随机排列不重复的列表列表
- python - 重定向输出时更改了 Windows 编码
- javascript - 如何获取 2 个不同行但 ID 相同的选定选项的值
- sql - SQLite 从具有相同主键的多行的表中删除重复项
- spring - 具有带注释字段的参数的 AOP 切入点?
- mysql - 将json字符串转换为mysql中的行
- xpath - 是否有任何 XPath 版本 2 的编译器可以转储执行树?
- flutter-web - 为什么当我在 pubspec.yaml 中包含 http 包时,我的 Flutter Web 项目不起作用?
- c++ - 如何重载 char*?
- xml - 将部分内容移动到其他重复部分