首页 > 解决方案 > 合并满足条件的唯一 group_by 行并使用 tidyverse 进行汇总

问题描述

我想使用 tidyverse 工具来执行以下操作:

对于下面的数据框,对于(“<em>yr”、“<em>bin”、“<em>range”)的每个唯一行,我想合并“<em>class”具有值的行“ B182E”或“B182G”通过以下操作合并为一行:保持“ yr ”、“ bin ”、“ range ”和“ miles ”的值不变,否则创建一个新变量“ class2 ”,其值变为“B182E_B182G”它保留了 "" 如果不是“B182E”或“B182G”,而对于变量“<em>new_count”和“<em>old_count”,通过对各自行求和来更新它们的值,例如“<em>new_count”变成“<em第 5 行中的 >new_count” 加上第 6 行中的“<em>new_count” 即 20 + 12 = 32,而“<em>old_count” 成为第 5 行中的“<em>old_count” 加上第 5 行中的“<em>old_count” 的总和第 6 行,即 30 + 14 = 44。

同样,将“<em>class”的值为“K760”或“K758”的行合并为一行,并以与上述类似的方式将新创建的“<em>class2”的值更新为“K760_K758” .

所有与上面要合并的“<em>class”值不匹配的行都保持不变。

以第 5 行和第 6 行为例,它们分别代表“<em>yr”、“<em>bin”和“<em>range”的唯一行,即 2005、2、40-49,所以这两行是按照上述条件合并为一行。

我怎样才能做到这一点?

df <- structure(list(yr = c(2005, 2005, 2005, 2005, 2005, 2005, 2005, 
2005, 2005, 2005, 2005, 2005, 2005, 2005, 2005, 2005, 2005, 2005, 
2005, 2005, 2005, 2005, 2005, 2005, 2005, 2005, 2005, 2005, 2005, 
2005, 2005, 2005, 2005, 2005, 2005, 2005, 2005, 2005, 2005, 2005, 
2005, 2005, 2005, 2005, 2005, 2005, 2005, 2005, 2005, 2005, 2005, 
2005, 2005, 2005, 2005, 2005, 2005, 2005, 2005, 2005, 2005, 2005, 
2005, 2005, 2005, 2005, 2005, 2005, 2006, 2006, 2006, 2006, 2006, 
2006, 2006, 2006, 2006, 2006, 2006, 2006, 2006, 2006, 2006, 2006, 
2006, 2006, 2006, 2006, 2006, 2006, 2006, 2006, 2006, 2006, 2006, 
2006, 2006, 2006, 2006, 2006, 2006, 2006, 2006, 2006, 2006, 2006, 
2006, 2006, 2006, 2006, 2006, 2006, 2006, 2006, 2006, 2006, 2006, 
2006, 2006, 2006, 2006, 2006, 2006, 2006, 2006, 2006, 2006, 2006, 
2006, 2006, 2006, 2006, 2006, 2006, 2006, 2006), bin = c("2", 
"2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", 
"2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", 
"2", "2", "2", "2", "2", "2", "2", "1", "1", "1", "1", "1", "1", 
"1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", 
"1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", 
"1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", 
"1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", 
"1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "2", "2", "2", 
"2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", 
"2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", 
"2", "2", "2", "2", "2"), range = c("40-49", "40-49", "40-49", 
"40-49", "40-49", "40-49", "40-49", "40-49", "40-49", "40-49", 
"40-49", "40-49", "40-49", "40-49", "40-49", "40-49", "40-49", 
"50-59", "50-59", "50-59", "50-59", "50-59", "50-59", "50-59", 
"50-59", "50-59", "50-59", "50-59", "50-59", "50-59", "50-59", 
"50-59", "50-59", "50-59", "40-49", "40-49", "40-49", "40-49", 
"40-49", "40-49", "40-49", "40-49", "40-49", "40-49", "40-49", 
"40-49", "40-49", "40-49", "40-49", "40-49", "40-49", "50-59", 
"50-59", "50-59", "50-59", "50-59", "50-59", "50-59", "50-59", 
"50-59", "50-59", "50-59", "50-59", "50-59", "50-59", "50-59", 
"50-59", "50-59", "40-49", "40-49", "40-49", "40-49", "40-49", 
"40-49", "40-49", "40-49", "40-49", "40-49", "40-49", "40-49", 
"40-49", "40-49", "40-49", "40-49", "40-49", "50-59", "50-59", 
"50-59", "50-59", "50-59", "50-59", "50-59", "50-59", "50-59", 
"50-59", "50-59", "50-59", "50-59", "50-59", "50-59", "50-59", 
"50-59", "40-49", "40-49", "40-49", "40-49", "40-49", "40-49", 
"40-49", "40-49", "40-49", "40-49", "40-49", "40-49", "40-49", 
"40-49", "40-49", "40-49", "40-49", "50-59", "50-59", "50-59", 
"50-59", "50-59", "50-59", "50-59", "50-59", "50-59", "50-59", 
"50-59", "50-59", "50-59", "50-59", "50-59", "50-59", "50-59"
), miles = c(528496, 528496, 528496, 528496, 528496, 528496, 
528496, 528496, 528496, 528496, 528496, 528496, 528496, 528496, 
528496, 528496, 528496, 604761, 604761, 604761, 604761, 604761, 
604761, 604761, 604761, 604761, 604761, 604761, 604761, 604761, 
604761, 604761, 604761, 604761, 547232, 547232, 547232, 547232, 
547232, 547232, 547232, 547232, 547232, 547232, 547232, 547232, 
547232, 547232, 547232, 547232, 547232, 613025, 613025, 613025, 
613025, 613025, 613025, 613025, 613025, 613025, 613025, 613025, 
613025, 613025, 613025, 613025, 613025, 613025, 557216.5, 557216.5, 
557216.5, 557216.5, 557216.5, 557216.5, 557216.5, 557216.5, 557216.5, 
557216.5, 557216.5, 557216.5, 557216.5, 557216.5, 557216.5, 557216.5, 
557216.5, 604720.5, 604720.5, 604720.5, 604720.5, 604720.5, 604720.5, 
604720.5, 604720.5, 604720.5, 604720.5, 604720.5, 604720.5, 604720.5, 
604720.5, 604720.5, 604720.5, 604720.5, 537808.5, 537808.5, 537808.5, 
537808.5, 537808.5, 537808.5, 537808.5, 537808.5, 537808.5, 537808.5, 
537808.5, 537808.5, 537808.5, 537808.5, 537808.5, 537808.5, 537808.5, 
597005.5, 597005.5, 597005.5, 597005.5, 597005.5, 597005.5, 597005.5, 
597005.5, 597005.5, 597005.5, 597005.5, 597005.5, 597005.5, 597005.5, 
597005.5, 597005.5, 597005.5), class = c("B181", "B181E", "B181G", 
"B182", "B182E", "B182G", "C220", "C221", "C240", "K70", "K703", 
"K743", "K746", "K754", "K758", "K760", "K769", "B181", "B181E", 
"B181G", "B182", "B182E", "B182G", "C220", "C221", "C240", "K70", 
"K703", "K743", "K746", "K754", "K758", "K760", "K769", "B181", 
"B181E", "B181G", "B182", "B182E", "B182G", "C220", "C221", "C240", 
"K70", "K703", "K743", "K746", "K754", "K758", "K760", "K769", 
"B181", "B181E", "B181G", "B182", "B182E", "B182G", "C220", "C221", 
"C240", "K70", "K703", "K743", "K746", "K754", "K758", "K760", 
"K769", "B181", "B181E", "B181G", "B182", "B182E", "B182G", "C220", 
"C221", "C240", "K70", "K703", "K743", "K746", "K754", "K758", 
"K760", "K769", "B181", "B181E", "B181G", "B182", "B182E", "B182G", 
"C220", "C221", "C240", "K70", "K703", "K743", "K746", "K754", 
"K758", "K760", "K769", "B181", "B181E", "B181G", "B182", "B182E", 
"B182G", "C220", "C221", "C240", "K70", "K703", "K743", "K746", 
"K754", "K758", "K760", "K769", "B181", "B181E", "B181G", "B182", 
"B182E", "B182G", "C220", "C221", "C240", "K70", "K703", "K743", 
"K746", "K754", "K758", "K760", "K769"), new_count = c(84, 24, 
16, 226, 20, 12, NA, NA, 5, 60, 24, 24, 27, 16, NA, 22, 78, 58, 
1, 4, 168, 5, 22, 20, 18, 15, 122, 72, 37, 75, 26, 0, 53, 149, 
117, 12, 45, 567, NA, 29, 12, 7, NA, 149, 77, NA, 54, NA, NA, 
44, 99, 124, 0, 8, 453, NA, 55, 54, 9, 10, 307, 167, 5, 120, 
9, NA, 59, 115, 137, NA, 7, 481, NA, 29, 13, 6, NA, 134, 65, 
NA, 62, 7, NA, 57, 93, 135, 23, 7, 444, 5, 79, 61, 18, 13, 323, 
190, 5, 135, 12, 0, 57, 117, 71, 0, NA, 204, NA, 7, 6, NA, NA, 
52, 22, 13, 24, 17, NA, 23, 86, 47, NA, NA, 172, NA, 24, 16, 
9, 15, 128, 87, 33, 77, 31, NA, 65, 158), old_count = c(211, 
52, 37, 1095, 30, 14, 8, 6, 8, 111, 52, 82, 53, 37, 8, 27, 102, 
120, 3, 4, 865, 5, 32, 30, 23, 16, 282, 168, 230, 149, 50, NA, 
71, 214, 349, 54, 67, 2102, NA, 39, 15, 9, 8, 263, 147, 6, 102, 
11, NA, 65, 123, 293, NA, 9, 1785, 1, 73, 70, 15, 15, 621, 369, 
27, 222, 20, NA, 81, 155, 407, NA, 9, 2065, 5, 46, 19, 8, NA, 
263, 143, 7, 113, 17, NA, 76, 126, 335, NA, 11, 2067, 5, 110, 
89, 21, 19, 632, 383, 24, 225, 21, NA, 79, 151, 234, 0, 0, 1061, 
NA, 9, 11, 5, 5, 106, 52, 85, 52, 41, 6, 31, 118, 133, NA, NA, 
957, NA, 40, 29, 14, 20, 282, 180, 224, 137, 59, 6, 81, 212)), row.names = c(NA, 
-136L), class = "data.frame") 

标签: rdplyrtidyverse

解决方案


您本质上希望通过yr, bin, range,使您的行独一无二class2。先创建class2,然后您可以轻松地进行分组和总结:

df %>%
  mutate(class2 = case_when(
    class %in% c("B181E", "B181G") ~ "B181E_B181G",
    class %in% c("K760", "K758") ~ "K760_K758",
    TRUE ~ class)
  ) %>%
  group_by(yr, bin, range, class2) %>%
  summarise(
    new_count = sum(new_count, na.rm = TRUE),
    old_count = sum(old_count, na.rm = TRUE)
  )
#> # A tibble: 120 x 6
#> # Groups:   yr, bin, range [8]
#> yr bin   range class2      new_count old_count
#> <dbl> <chr> <chr> <chr>           <dbl>     <dbl>
#> 1  2005 1     40-49 B181              117       349
#> 2  2005 1     40-49 B181E_B181G        57       121
#> 3  2005 1     40-49 B182              567      2102
#> 4  2005 1     40-49 B182E               0         0
#> 5  2005 1     40-49 B182G              29        39
#> 6  2005 1     40-49 C220               12        15
#> 7  2005 1     40-49 C221                7         9
#> 8  2005 1     40-49 C240                0         8
#> 9  2005 1     40-49 K70               149       263
#> 10  2005 1     40-49 K703               77       147
#> # … with 110 more rows

推荐阅读