r - R. Quanteda 软件包。如何过滤 dfm_tfidf 中存在的值?
问题描述
所以我有一个 dfm_tfidf,我想过滤掉低于某个阈值的值。
代码:
dfmat2 <-
matrix(c(1,1,2,1,0,0, 1,1,0,0,2,3),
byrow = TRUE, nrow = 2,
dimnames = list(docs = c("document1", "document2"),
features = c("this", "is", "a", "sample",
"another", "example"))) %>%
as.dfm()
#it works
dfmat2 %>% dfm_trim(min_termfreq = 3)
#it does not work
dfm_tfidf(dfmat2) %>% dfm_trim( min_termfreq = 1)
# "Warning message: In dfm_trim.dfm(., min_termfreq = 1) : dfm has been previously weighted"
问题:如何过滤掉 dfm_tfidf 中存在的值?
谢谢
解决方案
这是一个基于绝对最小值在稀疏矩阵空间中执行此操作的函数。但要小心,因为 tf-idf 绝对值在不同的 dfm 对象中意义不大。
library("quanteda")
## Package version: 2.1.1
dfmat2 <-
matrix(c(1, 1, 2, 1, 0, 0, 1, 1, 0, 0, 2, 3),
byrow = TRUE, nrow = 2,
dimnames = list(
docs = c("document1", "document2"),
features = c(
"this", "is", "a", "sample",
"another", "example"
)
)
) %>%
as.dfm()
# function to trim features based on absolute minimum threshold
# operating directly on sparse matrix
dfm_trimabs <- function(x, min) {
maxvals <- sapply(
split(dfmat3@x, featnames(dfmat3)[as(x, "dgTMatrix")@j + 1]),
max
)
dfm_keep(x, names(maxvals)[maxvals >= min])
}
现在将它应用到上面的示例之前和之后:
# before trimming
dfm_tfidf(dfmat2)
## Document-feature matrix of: 2 documents, 6 features (33.3% sparse).
## features
## docs this is a sample another example
## document1 0 0 0.60206 0.30103 0 0
## document2 0 0 0 0 0.60206 0.90309
# after trimming
dfm_tfidf(dfmat2) %>%
dfm_trimabs(min = 0.5)
## Document-feature matrix of: 2 documents, 3 features (50.0% sparse).
## features
## docs a another example
## document1 0.60206 0 0
## document2 0 0.60206 0.90309
推荐阅读
- perl - 哪些因素会影响 Catalyst 应用重启机制?(火鸟连接)
- android - 如何在后台使用 Fused Location Provider 持续跟踪位置?
- mongodb - 有没有办法过滤 mongodb 中的嵌入式数组?
- java - Java NIO 套接字:服务器没有通过同一个套接字接收第二条消息
- django - 从表格中保存一些信息,例如状态,城市,图像
- string - 计算数据框行中值 A 存在的次数、值 B 的次数以及值 A 和 B 的次数
- javascript - Web Animations API - 设置关键帧?
- wordpress - Wordpress 存储要在帖子中使用的表格数据的最佳方式
- php - 防止codeigniter sql中重复数据输出
- postgresql - Posgresql - 没有子查询的 jsonb 对象减少