r - DocumentTermMatrix /LDA 在没有空文档时产生非零输入错误
问题描述
我正在 R 中尝试我的第一个 LDA 模型并遇到错误
Error in LDA(Corpus_clean_dtm, k, method = "Gibbs", control = list(nstart = nstart, : Each row of the input matrix needs to contain at least one non-zero entry
这是我的模型代码,其中包括一些标准的预处理步骤
library(tm)
library(topicmodels)
library(textstem)
df_withduplicateID <- data.frame(
doc_id = c("2095/1", "2836/1", "2836/1", "2836/1", "9750/2",
"13559/1", "19094/1", "19053/1", "20215/1", "20215/1"),
text = c("He do subjects prepared bachelor juvenile ye oh.",
"He feelings removing informed he as ignorant we prepared.",
"He feelings removing informed he as ignorant we prepared.",
"He feelings removing informed he as ignorant we prepared.",
"Fond his say old meet cold find come whom. ",
"Wonder matter now can estate esteem assure fat roused.",
".Am performed on existence as discourse is.",
"Moment led family sooner cannot her window pulled any.",
"Why resolution one motionless you him thoroughly.",
"Why resolution one motionless you him thoroughly.")
)
clean_corpus <- function(corpus){
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, lemmatize_strings)
return(corpus)
}
df <- subset(df_withduplicateID, !duplicated(subset(df_withduplicateID, select = ID)))
Corpus <- Corpus(DataframeSource(df))
Corpus_clean <- clean_corpus(Corpus)
Corpus_clean_dtm <- DocumentTermMatrix(Corpus_clean)
burnin <- 4000
iter <- 2000
thin <- 500
seed <-list(203,500,623,1001,765)
nstart <- 5
best <- TRUE
k <- 5
LDAresult_1683 <- LDA(Corpus_clean_dtm, k, method = "Gibbs",
control = list(nstart = nstart, seed = seed, best = best,
burnin = burnin, iter = iter, thin = thin))
经过一番搜索,看起来我的 DocumentTermMatrix 可能包含空文档(之前在这里和这里提到过,这导致了这个错误消息。
然后我继续删除空文件,重新运行 LDA 模型,一切顺利。没有抛出任何错误。
rowTotals <- apply(Corpus_clean_dtm , 1, sum)
Corpus_clean_dtm.new <- Corpus_clean_dtm[rowTotals >0, ]
Corpus_clean_dtm.empty <- Corpus_clean_dtm[rowTotals <= 0, ]
Corpus_clean_dtm.empty$dimnames$Docs
我继续从 Corpus_clean_dtm.empty 中手动查找行 numberID(取出所有空文档条目)并匹配“Corpus_clean”中的相同 ID(& 行号),并意识到这些文档并不是真正的“空”并且每个“空”文档至少包含 20 个字符。
我在这里错过了什么吗?
解决方案
经过更多挖掘并受到此处讨论的启发- 如果我错了,请纠正我,但我认为我提出的问题是由tm
包中的实际错误引起的。在将我的数据框转换为VCorpus()
而不是使用Corpus()
, 并将包装器添加content_transformer()
到所有清理步骤后,我可以对所有文档进行词形分析并应用于DocumentTermMatrix()
干净的语料库而不会出现任何错误。如果我不将包装器content_transformer()
应用于清理过程,我的VCorpus()
对象将在清理后作为列表而不是语料库结构返回。LDA()
也不会引发任何错误。
我正在使用tm
版本 0.7-3 以供将来参考。
library(tm)
library(topicmodels)
library(textstem)
df_withduplicateID <- data.frame(
doc_id = c("2095/1", "2836/1", "2836/1", "2836/1", "9750/2",
"13559/1", "19094/1", "19053/1", "20215/1", "20215/1"),
text = c("He do subjects prepared bachelor juvenile ye oh.",
"He feelings removing informed he as ignorant we prepared.",
"He feelings removing informed he as ignorant we prepared.",
"He feelings removing informed he as ignorant we prepared.",
"Fond his say old meet cold find come whom. ",
"Wonder matter now can estate esteem assure fat roused.",
".Am performed on existence as discourse is.",
"Moment led family sooner cannot her window pulled any.",
"Why resolution one motionless you him thoroughly.",
"Why resolution one motionless you him thoroughly.")
)
clean_corpus <- function(corpus){
corpus <- tm_map(corpus, content_transformer(stripWhitespace))
corpus <- tm_map(corpus, content_transformer(removePunctuation))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, content_transformer(lemmatize_strings))
return(corpus)
}
df <- subset(df_withduplicateID, !duplicated(subset(df_withduplicateID, select = ID)))
Corpus <- VCorpus(DataframeSource(df), readerControl = list(reader = reader(DataframeSource(df)), language = "en"))
Corpus_clean <- clean_corpus(Corpus)
Corpus_clean_dtm <- DocumentTermMatrix(Corpus_clean)
推荐阅读
- google-tag-manager - Google 跟踪代码管理器转化价值未定义
- javascript - 使用模态“showOpenDialog”然后打开模态窗口只能工作一次
- reactjs - React - 我如何强制调用道具函数
- vue.js - 如何在Vuejs中将反应性道具传递给孩子
- exception - IMap InternalKey 的 HazelcastSerializationException - 如何调试或记录
- javascript - Node.js express - 无需进入路由函数即可获取路由参数以进行快速分析跟踪
- php - SafeMySQL 将 FLOAT 替换为 INT
- spring-boot - 执行 SQL Builder 类时出现 SQL UNINITIALIZED 错误 - MyBatis Springboot
- bash - 将值从 csv 文件存储到数组 Ubuntu
- jsp - 用于构建对象列表的 JSP EL 表达式