首页 > 解决方案 > 删除基于多列的重复项,但通过最少的 NA 选择重复项的“最”完整版本

问题描述

我有一个看起来像这样的代码

  Month|  Day|   Year| Color|   Weather|Location|Transporation|ID
  Jan     Tue    2020   Blue    Warm    Hospital    NA         1
  Jan     Tue    2020   Blue    Warm     NA         NA         1
  Jan     Tue    2020   Blue    NA       NA         NA         1
  Feb     Thu    2020   Red     NA       NA         NA         2
  Feb     Thu    2020   Red     Warm     NA         NA         2
  Feb     Thu    2020   Red     Warm    Garden      Run        2
  Mar     Thu    2020   Red     Cold    Desk        Bus        3

我希望它看起来像这样

Month|   Day|  Year|   Color|  Weather|Location|  Transporation|ID
Jan      Tue   2020    Blue    Warm    Hospital   NA            1
Feb      Thu   2020     Red    Warm    Garden     Run           2
Mar      Thu   2020     Red    Cold    Desk       Bus           3

基本上我想通过选择三个来确定一列是否重复c(ID,Month,Color)。一旦确定了重复项,我希望它删除具有最多 NA 或“完成最少”的那个,因为填充的列较少。

标签: rdplyrduplicates

解决方案


在按感兴趣的列分组后,我们可以使用 anorder来选择第一个非 NA 元素

library(dplyr)
dat %>%
    group_by(Month, Day, Year) %>% 
    summarise(across(everything(), ~ first(.[order(is.na(.))])), .groups = 'drop')

-输出

# A tibble: 3 x 8
  Month Day    Year Color Weather Location Transporation    ID
  <chr> <chr> <dbl> <chr> <chr>   <chr>    <chr>         <dbl>
1 Feb   Thu    2020 Red   Warm    Garden   Run               2
2 Jan   Tue    2020 Blue  Warm    Hospital <NA>              1
3 Mar   Thu    2020 Red   Cold    Desk     Bus               3

数据

dat <- structure(list(Month = c("Jan", "Jan", "Jan", "Feb", "Feb", "Feb", 
"Mar"), Day = c("Tue", "Tue", "Tue", "Thu", "Thu", "Thu", "Thu"
), Year = c(2020, 2020, 2020, 2020, 2020, 2020, 2020), Color = c("Blue", 
"Blue", "Blue", "Red", "Red", "Red", "Red"), Weather = c("Warm", 
"Warm", NA, NA, "Warm", "Warm", "Cold"), Location = c("Hospital", 
NA, NA, NA, NA, "Garden", "Desk"), Transporation = c(NA, NA, 
NA, NA, NA, "Run", "Bus"), ID = c(1, 1, 1, 2, 2, 2, 3)), class = "data.frame", row.names = c(NA, 
-7L))

推荐阅读