首页 > 解决方案 > R从文本中删除停用词而不将数据标记和转换为列表

问题描述

我需要从文本中删除停用词而不将对象标记或更改为列表。使用 rm_stopwords 函数时出现错误。谁能帮我吗?

test<- data.frame(words = c("hello there, everyone", "the most amazing planet"), id = 1:2)
test$words <- rm_stopwords(test$words, tm::stopwords("english"), separate = F, unlist = T)
#Error in `$<-.data.frame`(`*tmp*`, words, value = c("hello", "everyone",  : 
  #replacement has 4 rows, data has 2

#I want something like this, where the stopwords are removed but the rest of the formatting remains intact (e.g. punctuation) 

#                words     id
#1    hello  , everyone     1
#2    amazing planet        2

标签: rtexttidyversetidyrstop-words

解决方案


试试这种方法,它会产生类似于你想要的输出。您可以使用tidytext函数根据停用词制作过滤器,然后将过滤后的值融合到接近您期望的数据框中。这里的代码:

library(tidytext)
library(tidyverse)
#Data
test<- data.frame(words = c("hello there, everyone", "the most amazing planet"),
                  id = 1:2,stringsAsFactors = F)
#Unnest
l1 <- test %>% unnest_tokens(word,words,strip_punct = FALSE)
#Vector for stop words
vec<-tm::stopwords("english")
#Filter
l1<-l1[!(l1$word %in% vec),]
#Re aggregate by id
l2 <- l1 %>% group_by(id) %>% summarise(text=paste0(word,collapse = ' '))

输出:

# A tibble: 2 x 2
     id text            
  <int> <chr>           
1     1 hello , everyone
2     2 amazing planet  

推荐阅读