r - 清理后留在语料库中的停用词
问题描述
我正在尝试从我的语料库中删除停用词“the”,但并非所有实例都被删除。
library(RCurl)
library(tm)
url <- "https://raw.githubusercontent.com/angerhang/statsTutorial/master/src/textMining/data/1.txt"
file1 <- getURL(url)
url <- "https://raw.githubusercontent.com/angerhang/statsTutorial/master/src/textMining/data/2.txt"
file2 <- getURL(url)
url <- "https://raw.githubusercontent.com/angerhang/statsTutorial/master/src/textMining/data/3.txt"
file3 <- getURL(url)
shakespeare <- VCorpus(VectorSource(c(file1,file2,file3)))
list<-inspect(
DocumentTermMatrix(shakespeare,list(dictionary = c("the","thee")))
)
shakespeare <- tm_map(shakespeare, stripWhitespace)
shakespeare <- tm_map(shakespeare, stemDocument)
shakespeare <- tm_map(shakespeare, removePunctuation)
tm_map(shakespeare, content_transformer(tolower))
#taken directly from tm documentation
shakespeare <- tm_map(shakespeare, removeWords, c(stopwords("english"),"the"))
list<-inspect(
DocumentTermMatrix(shakespeare,list(dictionary = c("the","thee")))
)
第一个检查电话显示:
Terms
Docs the thee
1 11665 752
2 11198 660
3 4866 382
第二,清洁后:
Terms
Docs the thee
1 1916 1298
2 1711 1140
3 760 740
关于 removeWords 的逻辑,我在这里遗漏了什么,它会忽略所有这些“the”实例?
编辑
通过轻微的调用更改,我能够将“the”的实例降至 1000 以下,并使 removewords 调用第一个清理步骤:
shakespeare <- tm_map(shakespeare, removeWords, c(stopwords("english"),"the","The"))
这让我明白:
Docs the thee
1 145 752
2 130 660
3 71 382
尽管如此,我还是想知道为什么我似乎无法将它们全部消除。
解决方案
特此可重现的代码导致 0 个“the”实例。我解决了您的错字并在编辑之前使用了您的代码。
library(RCurl)
library(tm)
library(SnowballC)
url <- "https://raw.githubusercontent.com/angerhang/statsTutorial/master/src/textMining/data/1.txt"
file1 <- getURL(url)
url <- "https://raw.githubusercontent.com/angerhang/statsTutorial/master/src/textMining/data/2.txt"
file2 <- getURL(url)
url <- "https://raw.githubusercontent.com/angerhang/statsTutorial/master/src/textMining/data/3.txt"
file3 <- getURL(url)
shakespeare <- VCorpus(VectorSource(c(file1,file2,file3)))
list<-inspect(
DocumentTermMatrix(shakespeare,list(dictionary = c("the","thee")))
)
导致:
<<DocumentTermMatrix (documents: 3, terms: 2)>>
Non-/sparse entries: 6/0
Sparsity : 0%
Maximal term length: 4
Weighting : term frequency (tf)
Sample :
Terms
Docs the thee
1 11665 752
2 11198 660
3 4866 382
并在清理和解决错字后:
shakespeare <- tm_map(shakespeare, stripWhitespace)
shakespeare <- tm_map(shakespeare, stemDocument)
shakespeare <- tm_map(shakespeare, removePunctuation)
shakespeare = tm_map(shakespeare, content_transformer(tolower)) ## FIXED TYPO
#taken directly from tm documentation
shakespeare <- tm_map(shakespeare, removeWords, c(stopwords("english"),"the"))
list<-inspect(
DocumentTermMatrix(shakespeare,list(dictionary = c("the","thee")))
)
它导致:
<<DocumentTermMatrix (documents: 3, terms: 2)>>
Non-/sparse entries: 3/3
Sparsity : 50%
Maximal term length: 4
Weighting : term frequency (tf)
Sample :
Terms
Docs the thee
1 0 1298
2 0 1140
3 0 740
推荐阅读
- vim - 替换 Vim 中每一行的第一个单词
- excel - 如何选择依赖于 VBA 中另一个切片器的选定值的切片器项?(使用 OLAP 多维数据集)
- java - 如何修复 Timertask 只运行一次?
- c# - 是否可以仅在某个 div 上应用“IsValid”?
- javascript - JavaScript - 2个数组之间的匹配
- python - Python 'raise' 不带参数:什么是“在当前范围内活动的最后一个异常”?
- php - 选择后从后续选择列表中删除选项
- sql - 使用来自不同表的数据引用两列
- angular - 如何使用 DateRange 过滤 Kendo Ui Grid?
- javascript - node.js中的mysql和socket交互