r - 用`hunspell`词典词干
问题描述
从词干词中,我采用了以下自定义词干函数:
stem_hunspell <- function(term) {
# look up the term in the dictionary
stems <- hunspell::hunspell_stem(term)[[1]]
if (length(stems) == 0) { # if there are no stems, use the original term
stem <- term
} else { # if there are multiple stems, use the last one
stem <- stems[[length(stems)]]
}
stem
}
它使用hunspell
字典进行词干提取(包corpus
)。
我在以下句子中尝试了此功能。
sentences<-c("We're taking proactive steps to tackle ...",
"A number of measures we are taking to support ...",
"We caught him committing an indecent act.")
然后我执行了以下操作:
library(qdap)
library(tm)
sentences <- iconv(sentences, "latin1", "ASCII", sub="")
sentences <- gsub('http\\S+\\s*', '', sentences)
sentences <- bracketX(sentences,bracket='all')
sentences <- gsub("[[:punct:]]", "",sentences)
sentences <- removeNumbers(sentences)
sentences <- tolower(sentences)
# Stemming
library(corpus)
stem_hunspell <- function(term) {
# look up the term in the dictionary
stems <- hunspell::hunspell_stem(term)[[1]]
if (length(stems) == 0) { # if there are no stems, use the original term
stem <- term
} else { # if there are multiple stems, use the last one
stem <- stems[[length(stems)]]
}
stem
}
sentences=text_tokens(sentences, stemmer = stem_hunspell)
sentences = lapply(sentences, removeWords, stopwords('en'))
sentences = lapply(sentences, stripWhitespace)
我无法解释结果:
[[1]]
[1] "" "taking" "active" "step" "" "tackle"
[[2]]
[1] "" "numb" "" "measure" "" "" "taking" ""
[9] "support"
[[3]]
[1] "" "caught" "" "committing" "" "decent"
[7] "act"
例如,为什么 commit 和 take 出现在他们的 ing-form 中?为什么数字变得“麻木”?
解决方案
我认为答案主要是这hunspell
就是阻止的方式。我们可以用一个更简单的例子来检查这一点:
hunspell::hunspell_stem("taking")
#> [[1]]
#> [1] "taking"
hunspell::hunspell_stem("committing")
#> [[1]]
#> [1] "committing"
ing-form 是 hunspell 提供的唯一选项。对我来说,这也没有多大意义,我的建议是使用不同的词干分析器。在我们这样做的同时,我认为您也可以从切换到quanteda
而不是tm
:
library(quanteda)
sentences <- c("We're taking proactive steps to tackle ...",
"A number of measures we are taking to support ...",
"We caught him committing an indecent act.")
tokens(sentences, remove_numbers = TRUE) %>%
tokens_tolower() %>%
tokens_wordstem()
#> Tokens consisting of 3 documents.
#> text1 :
#> [1] "we'r" "take" "proactiv" "step" "to" "tackl" "."
#> [8] "." "."
#>
#> text2 :
#> [1] "a" "number" "of" "measur" "we" "are" "take"
#> [8] "to" "support" "." "." "."
#>
#> text3 :
#> [1] "we" "caught" "him" "commit" "an" "indec" "act" "."
在我看来,工作流程更加清晰,结果对我来说更有意义。quanteda
使用该SnowballC
包在此处进行词干提取,如果需要,您可以将其集成到您的tm
工作流程中。tokens
对象是文本,其顺序与输入对象相同,但已标记化(即,拆分为单词)。
如果您仍想使用hunspell
,您可以使用以下函数来执行此操作,它可以清除您似乎遇到的一些问题(“数字”现在是正确的):
stem_hunspell <- function(toks) {
# look up the term in the dictionary
stems <- vapply(hunspell::hunspell_stem(types(toks)), "[", 1, FUN.VALUE = character(1))
# if there are no stems, use the original term
stems[nchar(stems) == 0] <- types(toks)[nchar(stems) == 0]
tokens_replace(toks, types(toks), stems, valuetype = "fixed")
}
tokens(sentences, remove_numbers = TRUE, ) %>%
tokens_tolower() %>%
stem_hunspell()
#> Tokens consisting of 3 documents.
#> text1 :
#> [1] "we're" "taking" "active" "step" "to" "tackle" "." "."
#> [9] "."
#>
#> text2 :
#> [1] "a" "number" "of" "measure" "we" "are" "taking"
#> [8] "to" "support" "." "." "."
#>
#> text3 :
#> [1] "we" "caught" "him" "committing" "an"
#> [6] "decent" "act" "."
推荐阅读
- typescript - 如何允许缺少 .d.ts 类型定义的模块?
- wpf - 如何以可以覆盖全局样式的方式声明样式?
- python - Numpy 使用切片修改多个值的二维数组
- gremlin - 如何在 gremlin 本地对子查询中的元素进行排序?
- python - 包含查找 Pandas
- apache-spark - 使用 scala 正确加入 apache spark 数据帧,避免空值
- java - 使用 JSON 格式 java 获取字符串的特定部分
- delphi - Delphi JSONValue 获取值
- reactjs - 使用反应路由器 v4 传递参数并在 url 中使用它们
- c++ - 读取文件时将数据保存到指针数组