首页 > 解决方案 > 根据 R 中的模式删除观察值

问题描述

我有一个观察足球受伤情况的数据框。不幸的是,每种伤病我都有几支球队可供选择。这是数据框的一部分的样子:

df_x = data.frame(injury_id=c(250, 250, 100, 328, 328, 329, 329, 330, 330, 15, 5106, 5106, 5106),
 player_id=c(109, 109, 39728, 2374, 2374, 2374, 2374, 2374, 2374, 26, 59016, 59016, 59016), 
 season=c(2011, 2011, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2012, 2012, 2012), 
 inury_from=c("2011-09-13", "2011-09-13", "2011-03-03", "2011-04-21", "2011-04-21", "2010-11-23", "2010-11-23", "2010-10-01", "2010-10-01", "2011-02-24", "2012-09-16", "2012-09-16", "2012-09-16"),
 injury_until=c("2011-09-27", "2011-09-27", "2011-03-17", "2011-08-31", "2011-08-31", "2011-03-14", "2011-03-14", "2010-11-22", "2010-11-22", "2011-02-28", "2012-10-28", "2012-10-28", "2012-10-28"),
 team_id=c(1, 2, 3, 4, 5, 4, 5, 4, 5, 6, 7, 8, 9),
 member_since=c("1998-07-01", NA, "2009-07-01", "2008-07-01", NA, "2008-07-01", NA, "2008-07-01", NA, "2002-07-01", "2012-07-01", "2013-01-01", "2011-07-01"))

在此处输入图像描述

我的目标是每个injury_id 只有一行。结果应该出现以下数据框:

df_result_x = data.frame(injury_id=c(250, 100, 328, 329, 330, 15, 5106),
 player_id=c(109, 39728, 2374, 2374, 2374, 26, 59016),
 season=c(2011, 2010, 2010, 2010, 2010, 2010, 2012),
 inury_from=c("2011-09-13", "2011-03-03", "2011-04-21", "2010-11-23", "2010-10-01", "2011-02-24", "2012-09-16"),
 injury_until=c("2011-09-27", "2011-03-17", "2011-08-31", "2011-03-14", "2010-11-22", "2011-02-28", "2012-10-28"),
 team_id=c(1, 3, 4, 4, 4, 6, 7),
 member_since=c("1998-07-01", "2009-07-01", "2008-07-01", "2008-07-01", "2008-07-01", "2002-07-01", "2012-07-01"))

在此处输入图像描述

选择具有多个伤害 ID 的观察的算法:

在此处输入图像描述

我可以通过管道执行此操作还是必须使用循环?

谢谢你。

2020 年 11 月 10 日更新:

df_x2 = data.frame(injury_id=c(250, 250, 100, 328, 328, 329, 329, 330, 330, 15, 5106, 5106, 5106),
                  player_id=c(109, 109, 39728, 2374, 2374, 2374, 2374, 2374, 2374, 26, 59016, 59016, 59016),
                  season=c(2011, 2011, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2012, 2012, 2012),
                  inury_from=c("2011-09-13", "2011-09-13", "2011-03-03", "2011-04-21", "2011-04-21", "2010-11-23", "2010-11-23", "2010-10-01", "2010-10-01", "2011-02-24", "2012-09-16", "2012-09-16", "2012-09-16"),
                  injury_until=c("2011-09-27", "2011-09-27", "2011-03-17", "2011-08-31", "2011-08-31", "2011-03-14", "2011-03-14", "2010-11-22", "2010-11-22", "2011-02-28", "2012-10-28", "2012-10-28", "2012-10-28"),
                  team_id=c(1, 2, 3, 4, 5, 4, 5, 4, 5, 6, 8, 9, 7),
                  member_since=c("1998-07-01", NA, "2009-07-01", "2008-07-01", NA, "2008-07-01", NA, "2008-07-01", NA, "2002-07-01", "2013-01-01", "2011-07-01", "2012-12-31"))

标签: rloopspiperowdelete-row

解决方案


我们可以slice在按'injury_id'分组后使用

library(dplyr)
df_x %>%
    group_by(injury_id) %>%
    slice(1) %>% 
    ungroup 

或与distinct

df_x %>%
      distinct(injury_id, .keep_all = TRUE)

或者,如果NA元素不按顺序排列,则arrange在“injury_id”上执行一个逻辑向量,然后是基于“member_since”中的 NA 元素的逻辑向量(这样 NA 将是最后一个)和Date转换后的“member_since”,然后用于distinct选择基于“injury_id”列的第一个唯一行

df_x %>%
    arrange(injury_id, is.na(member_since), as.Date(member_since)) %>%
    distinct(injury_id, .keep_all = TRUE)

更新

根据评论

df_x %>%
    filter(!is.na(member_since)) %>%
    mutate(injury_until = as.Date(injury_until), 
          member_since = as.Date(member_since)) %>% 
    mutate(ind = injury_until - member_since) %>% 
    group_by(injury_id)  %>%
    filter(ind == min(ind[ind > 0])) %>%
    select(-ind)

-输出

# A tibble: 7 x 7
# Groups:   injury_id [7]
#  injury_id player_id season inury_from injury_until team_id member_since
#      <dbl>     <dbl>  <dbl> <chr>      <date>         <dbl> <date>      
#1       250       109   2011 2011-09-13 2011-09-27         1 1998-07-01  
#2       100     39728   2010 2011-03-03 2011-03-17         3 2009-07-01  
#3       328      2374   2010 2011-04-21 2011-08-31         4 2008-07-01  
#4       329      2374   2010 2010-11-23 2011-03-14         4 2008-07-01  
#5       330      2374   2010 2010-10-01 2010-11-22         4 2008-07-01  
#6        15        26   2010 2011-02-24 2011-02-28         6 2002-07-01  
#7      5106     59016   2012 2012-09-16 2012-10-28         7 2012-07-01  

推荐阅读