首页 > 解决方案 > Create new columns based on comma-separated values in another column in R

问题描述

I have some data similar to that below.

df <- data.frame(id = 1:5, tags = c("A,B,AB,C", "C", "AB,E", NA, "B,C"))
df

#   id     tags
# 1  1 A,B,AB,C
# 2  2        C
# 3  3     AB,E
# 4  4     <NA>
# 5  5      B,C

I'd like to create a new dummy variable column for each tag in the "tags" column, resulting in a dataframe like the following:

correct_df <- data.frame(id = 1:5, 
                     tags = c("A,B,AB,C", "C", "AB,E", NA, "B,C"),
                     A = c(1, 0, 0, 0, 0),
                     B = c(1, 0, 0, 0, 1),
                     C = c(1, 1, 0, 0, 1),
                     E = c(0, 0, 1, 0, 0),
                     AB = c(1, 0, 1, 0, 0)
                    )
correct_df

#   id     tags A B C E AB
# 1  1 A,B,AB,C 1 1 1 0  1
# 2  2        C 0 0 1 0  0
# 3  3     AB,E 0 0 0 1  1
# 4  4     <NA> 0 0 0 0  0
# 5  5      B,C 0 1 1 0  0

One of the challenges is ensuring that the "A" column has 1 only for the "A" tag, so that it doesn't has 1 for the "AB" tag, for example. The following won't work for this reason, since "A" gets 1 for the "AB" tag:

df <- df %>%
  mutate(A = ifelse(grepl("A", tags, fixed = T), 1, 0))
df

#   id     tags A
# 1  1 A,B,AB,C 1
# 2  2        C 0
# 3  3     AB,E 1 < Incorrect
# 4  4     <NA> 0
# 5  5      B,C 0

Another challenge is doing this programmatically. I can probably deal with a solution that manually creates a column for each tag, but a solution that doesn't assume which tag columns need to be created beforehand is best, since there can potentially be many different tags. Is there some relatively simple solution that I'm overlooking?

标签: rdataframe

解决方案


这是否有效:

> library(tidyr)
> library(dplyr)
> df %>% separate_rows(tags) %>% mutate(A = case_when(tags == 'A' ~ 1, TRUE ~ 0),
+                                       B = case_when(tags == 'B' ~ 1, TRUE ~  0),
+                                       C = case_when(tags == 'C' ~ 1,  TRUE ~  0),
+                                       E = case_when(tags == 'E' ~ 1,  TRUE ~ 0),
+                                       AB = case_when(tags == 'AB' ~ 1,  TRUE ~ 0)) %>% 
+   group_by(id) %>% mutate(tags = toString(tags)) %>% group_by(id, tags) %>% summarise(across(A:AB, sum))
`summarise()` regrouping output by 'id' (override with `.groups` argument)
# A tibble: 5 x 7
# Groups:   id [5]
     id tags            A     B     C     E    AB
  <int> <chr>       <dbl> <dbl> <dbl> <dbl> <dbl>
1     1 A, B, AB, C     1     1     1     0     1
2     2 C               0     0     1     0     0
3     3 AB, E           0     0     0     1     1
4     4 NA              0     0     0     0     0
5     5 B, C            0     1     1     0     0
> 

推荐阅读