首页 > 解决方案 > DocumentTermMatrix /LDA 在没有空文档时产生非零输入错误

问题描述

我正在 R 中尝试我的第一个 LDA 模型并遇到错误

Error in LDA(Corpus_clean_dtm, k, method = "Gibbs", control = list(nstart = nstart,  :    Each row of the input matrix needs to contain at least one non-zero entry

这是我的模型代码,其中包括一些标准的预处理步骤

 library(tm)
 library(topicmodels)
 library(textstem)


df_withduplicateID <- data.frame(
  doc_id = c("2095/1", "2836/1", "2836/1", "2836/1", "9750/2", 
    "13559/1", "19094/1", "19053/1", "20215/1", "20215/1"), 
  text = c("He do subjects prepared bachelor juvenile ye oh.", 
    "He feelings removing informed he as ignorant we prepared.",
    "He feelings removing informed he as ignorant we prepared.",
    "He feelings removing informed he as ignorant we prepared.",
    "Fond his say old meet cold find come whom. ",
    "Wonder matter now can estate esteem assure fat roused.",
    ".Am performed on existence as discourse is.", 
    "Moment led family sooner cannot her window pulled any.",
    "Why resolution one motionless you him thoroughly.", 
    "Why resolution one motionless you him thoroughly.")     
)


clean_corpus <- function(corpus){
                  corpus <- tm_map(corpus, stripWhitespace)
                  corpus <- tm_map(corpus, removePunctuation)
                  corpus <- tm_map(corpus, tolower)
                  corpus <- tm_map(corpus, lemmatize_strings)
                  return(corpus)
                }

df <- subset(df_withduplicateID, !duplicated(subset(df_withduplicateID, select = ID)))
Corpus <- Corpus(DataframeSource(df))
Corpus_clean <- clean_corpus(Corpus)
Corpus_clean_dtm <- DocumentTermMatrix(Corpus_clean)


burnin <- 4000
iter <- 2000
thin <- 500
seed <-list(203,500,623,1001,765)
nstart <- 5
best <- TRUE
k <- 5

LDAresult_1683 <- LDA(Corpus_clean_dtm, k, method = "Gibbs", 
  control = list(nstart = nstart, seed = seed, best = best, 
  burnin = burnin, iter = iter, thin = thin))

经过一番搜索,看起来我的 DocumentTermMatrix 可能包含空文档(之前在这里这里提到过,这导致了这个错误消息。

然后我继续删除空文件,重新运行 LDA 模型,一切顺利。没有抛出任何错误。

rowTotals <- apply(Corpus_clean_dtm , 1, sum)
Corpus_clean_dtm.new <- Corpus_clean_dtm[rowTotals >0, ]
Corpus_clean_dtm.empty <- Corpus_clean_dtm[rowTotals <= 0, ]
Corpus_clean_dtm.empty$dimnames$Docs

我继续从 Corpus_clean_dtm.empty 中手动查找行 numberID(取出所有空文档条目)并匹配“Corpus_clean”中的相同 ID(& 行号),并意识到这些文档并不是真正的“空”并且每个“空”文档至少包含 20 个字符。

我在这里错过了什么吗?

标签: rtexttmldatopic-modeling

解决方案


经过更多挖掘并受到此处讨论的启发- 如果我错了,请纠正我,但我认为我提出的问题是由tm包中的实际错误引起的。在将我的数据框转换为VCorpus()而不是使用Corpus(), 并将包装器添加content_transformer()到所有清理步骤后,我可以对所有文档进行词形分析并应用于DocumentTermMatrix()干净的语料库而不会出现任何错误。如果我不将包装器content_transformer()应用于清理过程,我的VCorpus()对象将在清理后作为列表而不是语料库结构返回。LDA()也不会引发任何错误。

我正在使用tm版本 0.7-3 以供将来参考。

library(tm)
 library(topicmodels)
 library(textstem)


df_withduplicateID <- data.frame(
  doc_id = c("2095/1", "2836/1", "2836/1", "2836/1", "9750/2", 
    "13559/1", "19094/1", "19053/1", "20215/1", "20215/1"), 
  text = c("He do subjects prepared bachelor juvenile ye oh.", 
    "He feelings removing informed he as ignorant we prepared.",
    "He feelings removing informed he as ignorant we prepared.",
    "He feelings removing informed he as ignorant we prepared.",
    "Fond his say old meet cold find come whom. ",
    "Wonder matter now can estate esteem assure fat roused.",
    ".Am performed on existence as discourse is.", 
    "Moment led family sooner cannot her window pulled any.",
    "Why resolution one motionless you him thoroughly.", 
    "Why resolution one motionless you him thoroughly.")     
)


clean_corpus <- function(corpus){
                  corpus <- tm_map(corpus, content_transformer(stripWhitespace))
                  corpus <- tm_map(corpus, content_transformer(removePunctuation))
                  corpus <- tm_map(corpus, content_transformer(tolower))
                  corpus <- tm_map(corpus, content_transformer(lemmatize_strings))
                  return(corpus)
                }

df <- subset(df_withduplicateID, !duplicated(subset(df_withduplicateID, select = ID)))
Corpus <- VCorpus(DataframeSource(df), readerControl = list(reader = reader(DataframeSource(df)), language = "en"))
Corpus_clean <- clean_corpus(Corpus)
Corpus_clean_dtm <- DocumentTermMatrix(Corpus_clean)

推荐阅读