r - R:可以从每个句子(行)中提取词组吗?并创建数据框(或矩阵)?
问题描述
我为每个单词创建了列表以从句子中提取单词,例如像这样
hello<- NULL
for (i in 1:length(text)){
hello[i]<-as.character(regmatches(text[i], gregexpr("[H|h]ello?", text[i])))
}
但是我有超过 25 个单词列表要提取,那是很长的编码。 是否可以从文本数据中提取一组字符(单词)?
下面只是一个伪集合。
words<-c("[H|h]ello","you","so","tea","egg")
text=c("Hello! How's you and how did saturday go?",
"hello, I was just texting to see if you'd decided to do anything later",
"U dun say so early.",
"WINNER!! As a valued network customer you have been selected" ,
"Lol you're always so convincing.",
"Did you catch the bus ? Are you frying an egg ? ",
"Did you make a tea and egg?"
)
subsets<-NULL
for ( i in 1:length(text)){
.....???
}
预期输出如下
[1] Hello you
[2] hello you
[3] you
[4] you so
[5] you you egg
[6] you tea egg
解决方案
在基础 R 中,您可以执行以下操作:
regmatches(text,gregexpr(sprintf("\\b(%s)\\b",paste0(words,collapse = "|")),text))
[[1]]
[1] "Hello" "you"
[[2]]
[1] "hello" "you"
[[3]]
[1] "so"
[[4]]
[1] "you"
[[5]]
[1] "you" "so"
[[6]]
[1] "you" "you" "egg"
[[7]]
[1] "you" "tea" "egg"
取决于你想要的结果:
trimws(gsub(sprintf(".*?\\b(%s).*?|.*$",paste0(words,collapse = "|")),"\\1 ",text))
[1] "Hello you" "hello you" "so" "you" "you so" "you you egg"
[7] "you tea egg"