r - 停用词的具体列表 quanteda
问题描述
我想使用 quanteda 删除带有停用词的特定列表。
我用这个:
df <- data.frame(data = c("Here is an example text and why I write it", "I can explain and here you but I can help as I would like to help"))
mystopwords <- c("is","an")
corpus<- dfm(tokens_remove(tokens(df$data, remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE), remove = c(stopwords(language = "el", source = "misc"), mystopwords), ngrams = c(4,6)))
但我收到此错误:
> Error in tokens_select(x, ..., selection = "remove") :
unused arguments (remove = c(stopwords(language = "en", source = "misc"), stopwords1), ngrams = c(4, 6))
如何在 quanteda 中使用 mystopwords 列表的正确方法?
解决方案
基于@phiver 的回答,这是删除quanteda中特定标记的标准方法。没有必要使用 of,stopwords()
因为您提供了要删除的标记向量,并且patterns
参数可以采用向量,而是使用valuetype = 'fixed'
。
我使用dplyr来提高代码的可读性,但您不必这样做。
library(quanteda)
library(dplyr)
df <- data.frame(data = c("Here is an example text and why I write it",
"I can explain and here you but I can help as I would like to help"),
stringsAsFactors = FALSE)
mystopwords <- c("is","an")
corpus <-
tokens(df$data,
remove_punct = TRUE,
remove_numbers = TRUE,
remove_symbols = TRUE) %>%
tokens_remove(pattern = mystopwords,
valuetype = 'fixed') %>%
dfm(ngrams = c(4,6))
推荐阅读
- python - 搜索栏没有内容返回
- c# - 将极长的过滤器传递给 dotnet 测试?
- amazon-web-services - 参数 CacheSubnetGroupName 必须提供,不能为空
- java - 使用自定义查询使用spring boot jpa将数据从表复制到另一个
- python-3.x - 如何删除出现在 kivy 窗口中的球?
- javascript - 如果 prop 未定义,如何在反应中重定向
- mysql - MacOS MySQL with Workbench - How many EER Diagrams can I have?
- javascript - Can't access prevState with useState hook
- python - statsmodels GLM Negative Binomial autoregressive covariance type
- javascript - 窗口卸载功能未执行