首页 > 解决方案 > 清理后留在语料库中的停用词

问题描述

我正在尝试从我的语料库中删除停用词“the”,但并非所有实例都被删除。

library(RCurl)
library(tm)

url <- "https://raw.githubusercontent.com/angerhang/statsTutorial/master/src/textMining/data/1.txt"
file1 <- getURL(url)
url <- "https://raw.githubusercontent.com/angerhang/statsTutorial/master/src/textMining/data/2.txt"
file2 <- getURL(url)
url <- "https://raw.githubusercontent.com/angerhang/statsTutorial/master/src/textMining/data/3.txt"
file3 <- getURL(url)


shakespeare <- VCorpus(VectorSource(c(file1,file2,file3)))

list<-inspect(
  DocumentTermMatrix(shakespeare,list(dictionary = c("the","thee")))
)
shakespeare <- tm_map(shakespeare, stripWhitespace)
shakespeare <- tm_map(shakespeare, stemDocument)
shakespeare <- tm_map(shakespeare, removePunctuation)
tm_map(shakespeare, content_transformer(tolower))
#taken directly from tm documentation
shakespeare <- tm_map(shakespeare, removeWords, c(stopwords("english"),"the"))
list<-inspect(
  DocumentTermMatrix(shakespeare,list(dictionary = c("the","thee")))
)

第一个检查电话显示:

    Terms
Docs   the thee
   1 11665  752
   2 11198  660
   3  4866  382

第二,清洁后:

    Terms
Docs  the thee
   1 1916 1298
   2 1711 1140
   3  760  740

关于 removeWords 的逻辑,我在这里遗漏了什么,它会忽略所有这些“the”实例?

编辑

通过轻微的调用更改,我能够将“the”的实例降至 1000 以下,并使 removewords 调用第一个清理步骤:

shakespeare <- tm_map(shakespeare, removeWords, c(stopwords("english"),"the","The"))

这让我明白:

Docs the thee
   1 145  752
   2 130  660
   3  71  382

尽管如此,我还是想知道为什么我似乎无法将它们全部消除。

标签: r

解决方案


特此可重现的代码导致 0 个“the”实例。我解决了您的错字并在编辑之前使用了您的代码。

library(RCurl)
library(tm)
library(SnowballC)

url <- "https://raw.githubusercontent.com/angerhang/statsTutorial/master/src/textMining/data/1.txt"
file1 <- getURL(url)
url <- "https://raw.githubusercontent.com/angerhang/statsTutorial/master/src/textMining/data/2.txt"
file2 <- getURL(url)
url <- "https://raw.githubusercontent.com/angerhang/statsTutorial/master/src/textMining/data/3.txt"
file3 <- getURL(url)


shakespeare <- VCorpus(VectorSource(c(file1,file2,file3)))

list<-inspect(
  DocumentTermMatrix(shakespeare,list(dictionary = c("the","thee")))
)

导致:

<<DocumentTermMatrix (documents: 3, terms: 2)>>
Non-/sparse entries: 6/0
Sparsity           : 0%
Maximal term length: 4
Weighting          : term frequency (tf)
Sample             :
    Terms
Docs   the thee
   1 11665  752
   2 11198  660
   3  4866  382

并在清理和解决错字后:

shakespeare <- tm_map(shakespeare, stripWhitespace)
shakespeare <- tm_map(shakespeare, stemDocument)
shakespeare <- tm_map(shakespeare, removePunctuation)
shakespeare = tm_map(shakespeare, content_transformer(tolower)) ## FIXED TYPO
#taken directly from tm documentation
shakespeare <- tm_map(shakespeare, removeWords, c(stopwords("english"),"the"))
list<-inspect(
  DocumentTermMatrix(shakespeare,list(dictionary = c("the","thee")))
)

它导致:

<<DocumentTermMatrix (documents: 3, terms: 2)>>
Non-/sparse entries: 3/3
Sparsity           : 50%
Maximal term length: 4
Weighting          : term frequency (tf)
Sample             :
    Terms
Docs the thee
   1   0 1298
   2   0 1140
   3   0  740

推荐阅读