r - Document term matrix function returning 0 when applying the document term matrix
问题描述
I have a corpus of 600 text files that I want to extract from it every numerical combination after the term mim
and create the document term matrix
to find frequencies per file
.. i used this code, it extracted all the wanted terms but it returns 0
when applying the document Term matrix.. my corpus is a simple text file corpus that contains just text this my code
library("tm")
library("stringr")
mim<-stringr::str_extract_all(DBcorp,"(mim)[[:blank:]]*[[:digit:]]+")
#extract numbers
mim<-stringr::str_extract_all(mim,"[[:digit:]]+")
#set the result as list + delete duplicated extracted terms
mim<-unique(unlist(mim[[1]]))
mim
[1] "608106" "606843" "103600" "231550"
class(omim)
[1] "character"
#document term matrix
dtm_mim <- DocumentTermMatrix(DBcorp, control=list(dictionary=mim))
# turn document term matrix into data.frame
df_mim <- data.frame(DOC = dtm_mim$dimnames$Docs, as.matrix(dtm_mim), row.names = NULL , check.names = FALSE)
df_mim
608106 606843 103600 231550
1.txt 0 0 0 0
2.txt 0 0 0 0
3.txt 0 0 0 0
this is a sample of my data, when i use it as this manner; it works well
docs = c(doc1 = "mim 608106 letters 123 mim 606843 letters 1 letters 123456789 ",
doc2 = "letters letters 1 mim 231550 123 letters",
doc3 = "mim 103600 letters 123456")
docs<-Corpus(VectorSource(docs))
but when i create it doc in a separate text file it fails in extraction
DBcorp<- VCorpus(DirSource("c:\Users\Desktop\files"))
> DBcorp
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 154
解决方案
请尝试以下代码。如果您想在 tm 语料库上使用函数,最好使用 lapply(或 tm_map)。这将仅返回 dtm 中出现在 mim 中的术语。
# note the use of simplify = TRUE. This makes sure you do not get a warning in the line after this one.
mim <- lapply(DBcorp, stringr::str_extract_all, "(mim)[[:blank:]]*[[:digit:]]+", simplify = TRUE)
mim <- lapply(mim, stringr::str_extract_all, "[[:digit:]]+")
mim <- unique(unlist(mim))
dtm_mim <- DocumentTermMatrix(DBcorp, control = list(dictionary = mim))
df_mim <- data.frame(DOC = dtm_mim$dimnames$Docs, as.matrix(dtm_mim), row.names = NULL , check.names = FALSE)
推荐阅读
- reactjs - 如何将 Gatsby Image 组件转换为图形
- r - 使用整数值设置 x 轴刻度位置
- python - Django:如果与其他模型相关联,则更改模型值
- php - 购买后未返回 PayPal POST 变量
- node.js - 如何将 MongoDB 添加到我已经取得很大进展的 React Web 应用程序中?
- docker - 如何将 docker 镜像从网站推送到远程设备并自动运行
- java - 具有恒定循环和递归的给定函数的时间复杂度
- contextmenu - 如何在制表单元格中单击时添加上下文菜单?
- json - ReactJS 嵌套属性不适用于 JSON 数据
- applescript - 为什么在 AppleScript for BBEdit 中使用替换时出现错误?