首页 > 解决方案 > 当矩阵太大而无法进行常规操作时,如何删除 DFM 中的零项?

问题描述

我有以下问题:我将语料库转换为 dfm,而这个 dfmm 有一些零条目,我需要在拟合 LDA 模型之前删除这些条目。我通常会这样做:

OutDfm <- dfm_trim(dfm(corpus, tolower = TRUE, remove = c(stopwords("english"), stopwords("german"), stopwords("french"), stopwords("italian")), remove_punct = TRUE, remove_numbers = TRUE, remove_separators = TRUE, stem = TRUE, verbose = TRUE), min_docfreq = 5)

Creating a dfm from a corpus input...
   ... lowercasing
   ... found 272,912 documents, 112,588 features
   ... removed 613 features
   ... stemming features (English)
, trimmed 27491 feature variants
   ... created a 272,912 x 84,515 sparse dfm
   ... complete. 
Elapsed time: 78.7 seconds.


# remove zero-entries
raw.sum=apply(OutDfm,1,FUN=sum)
which(raw.sum == 0)
OutDfm = OutDfm[raw.sum!=0,]

然而,当我尝试执行我得到的最后一个操作时:Error in asMethod(object) : Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105暗示矩阵太大而无法操作。

有没有人遇到过并解决过这个问题?删除 0 条目的任何替代策略?

非常感谢!

标签: rdataframeldaquanteda

解决方案


您的applywithsum将 dfm 从稀疏矩阵转换为密集矩阵以计算行和。

要么使用slam::row_sums,因为 slam 函数适用于稀疏矩阵,但更好的是,只使用quantada::dfm_subset选择所有超过 0 个标记的文档。

dfm_subset(OutDfm, ntoken(OutDfm) > 0)

展示它如何与 ntokens > 5000 一起工作的示例:

library(quanteda)
x <- corpus(data_corpus_inaugural)
x <- dfm(x)
x
Document-feature matrix of: 58 documents, 9,360 features (91.8% sparse) and 4 docvars.
                 features
docs              fellow-citizens  of the senate and house representatives : among vicissitudes
  1789-Washington               1  71 116      1  48     2               2 1     1            1

# subset based on amount of tokens.
dfm_subset(x, ntoken(x) > 5000)
Document-feature matrix of: 3 documents, 9,360 features (84.1% sparse) and 4 docvars.
               features
docs            fellow-citizens  of the senate and house representatives : among vicissitudes
  1841-Harrison              11 604 829      5 231     1               4 1     3            0

推荐阅读