首页 > 解决方案 > 停用词的具体列表 quanteda

问题描述

我想使用 quanteda 删除带有停用词的特定列表。

我用这个:

df <- data.frame(data = c("Here is an example text and why I write it", "I can explain and here you but I can help as I would like to help"))
mystopwords <- c("is","an")
corpus<- dfm(tokens_remove(tokens(df$data, remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE), remove = c(stopwords(language = "el", source = "misc"), mystopwords), ngrams = c(4,6)))

但我收到此错误:

> Error in tokens_select(x, ..., selection = "remove") : 
  unused arguments (remove = c(stopwords(language = "en", source = "misc"), stopwords1), ngrams = c(4, 6))

如何在 quanteda 中使用 mystopwords 列表的正确方法?

标签: rquanteda

解决方案


基于@phiver 的回答,这是删除quanteda中特定标记的标准方法。没有必要使用 of,stopwords()因为您提供了要删除的标记向量,并且patterns参数可以采用向量,而是使用valuetype = 'fixed'

我使用dplyr来提高代码的可读性,但您不必这样做。

library(quanteda)
library(dplyr)
df <- data.frame(data = c("Here is an example text and why I write it", 
                          "I can explain and here you but I can help as I would like to help"),
                 stringsAsFactors = FALSE)

mystopwords <- c("is","an")
corpus <- 
  tokens(df$data,
         remove_punct = TRUE, 
         remove_numbers = TRUE, 
         remove_symbols = TRUE) %>%
  tokens_remove(pattern = mystopwords,
                valuetype = 'fixed') %>%
  dfm(ngrams = c(4,6))

推荐阅读