首页 > 解决方案 > R tibble 字符串的所有共同词 - coocurences - bigram - dplyr

问题描述

我有一个这种格式的数据框:

df <- data.frame(names= c('perform data cleansing','information categorisation', ''))
                             names
1           perform data cleansing
2       information categorisation
3 write batch record documentation

我想用所有的共同点来获得这个:

                             names           tokens1              tokens2
1           perform data cleansing           perform                 data
1           perform data cleansing              data            cleansing 
1           perform data cleansing         cleansing              perform
2       information categorisation       information       categorisation
3 write batch record documentation             write                batch
3 write batch record documentation             write               record
3 write batch record documentation             write        documentation 
3 write batch record documentation             batch               record 
3 write batch record documentation             batch        documentation 
3 write batch record documentation            record        documentation 

因此,对于n字符串中的单词,您将拥有n x (n-1) / 2coocurencies。

标签: rdataframedplyrnlp

解决方案


我们可以用空格分割“名字”,遍历list分割的元素,得到一次选择两个单词的unnest组合list

library(tidyverse)
df %>%
   mutate(tokens = strsplit(names, " ") %>%
                     map(~ .x %>%
                          combn(m = 2, simplify = FALSE))) %>%
   unnest

如果我们需要两个单独的“tokens”列,我们pastecombn单词组合在一起,然后unnestseparate“tokens”分成两列,方法是在用于paste组合在一起的分隔符处拆分

df %>%
    mutate(tokens = strsplit(names, " ") %>%
                      map(~ .x %>%
                           combn(m = 2, FUN = function(x) 
                                paste(x[1], x[2], sep="-"), simplify = FALSE))) %>%
                                    unnest %>%
                                    unnest %>% 
                                    separate(tokens, into = c('tokens1', 'tokens2'))
#                               names     tokens1        tokens2
#1            perform data cleansing     perform           data
#2            perform data cleansing     perform      cleansing
#3            perform data cleansing        data      cleansing
#4        information categorisation information categorisation
#5  write batch record documentation       write          batch
#6  write batch record documentation       write         record
#7  write batch record documentation       write  documentation
#8  write batch record documentation       batch         record
#9  write batch record documentation       batch  documentation
#10 write batch record documentation      record  documentation

数据

df <- structure(list(names = c("perform data cleansing", 
   "information categorisation", 
 "write batch record documentation")), class = "data.frame",
  row.names = c("1", "2", "3"))

推荐阅读