首页 > 解决方案 > 如何将逗号分隔的变量分组在同一列中?

问题描述

这是我的虚假数据:

#> id   column                 
#> 1    blue, red, dog, cat
#> 2    red, blue, dog
#> 3    blue      
#> 4    red
#> 5    dog, cat   
#> 6    cat
#> 7    red, cat
#> 8    dog
#> 9    cat, red
#> 10   blue, cat

例如,我想告诉 Rdog and cat = animalred and blue = colour. 我想基本上计算动物、颜色和两者的数量(以及最终百分比)。

#> id   column                 newcolumn
#> 1    blue, red, dog, cat    both
#> 2    red, blue, dog         both
#> 3    blue                   colour
#> 4    red                    colour
#> 5    dog, cat               animal
#> 6    cat                    animal
#> 7    red, cat               both
#> 8    dog                    animal
#> 9    cat, red               both
#> 10   blue, cat              both

到目前为止,我只能通过执行以下操作来合计红色、蓝色、狗和猫的数量:

column.string<-paste(df$column, collapse=",")
column.vector<-strsplit(column.string, ",")[[1]]
column.vector.clean<-gsub(" ", "", column.vector)
table(column.vector.clean)

非常感谢您的帮助,这是我的示例错误数据:

df <- data.frame(id = c(1:10), 
                 column = c("blue, red, dog, cat", "red, blue, dog", "blue", "red", "dog, cat", "cat", "red, cat", "dog", "cat, red", "blue, cat"))

标签: r

解决方案


您可以在向量中定义所有可能animal的 s 和s。colour拆分column逗号和测试:

animal <- c('dog', 'cat')
colour <- c('red', 'blue')

df$newcolumn <- sapply(strsplit(df$column, ',\\s*'), function(x) {
                 x <- x[x != "NA"]
                 if(!length(x)) return(NA)
                 if(all(x %in% animal)) 'animal'
                 else if(all(x %in% colour)) 'colour'
                 else 'both'
                 })

df
#   id              column newcolumn
#1   1 blue, red, dog, cat      both
#2   2      red, blue, dog      both
#3   3                blue    colour
#4   4                 red    colour
#5   5            dog, cat    animal
#6   6                 cat    animal
#7   7            red, cat      both
#8   8                 dog    animal
#9   9            cat, red      both
#10 10           blue, cat      both

要计算比例,您可以使用prop.tablewith table

prop.table(table(df$newcolumn, useNA = "ifany"))

#animal   both colour 
#   0.3    0.5    0.2 

使用dplyr,我们可以用逗号分隔行,为每行id创建一个newcolumn基于条件并计算比例。

library(dplyr)

df %>%
  tidyr::separate_rows(column, sep = ',\\s*') %>%
  group_by(id) %>%
  summarise(newcolumn = case_when(all(column %in% animal) ~ 'animal', 
                                  all(column %in% colour) ~ 'colour', 
                                  TRUE ~ 'both'),
            column = toString(column)) %>%
  count(newcolumn) %>%
  mutate(n = n/sum(n))

推荐阅读