首页 > 解决方案 > 是否有 R 代码可以删除具有两个 - 三个不同列条件的相同 ID 的重复事件?

问题描述

我有一个包含数千个 id 的数据框,每个 id 和注册日期、课程和记录有几个事件。课程是分类的,模块1,模块2,模块3,模块4,模块5和撤回(任何模块)。例如,几行如下所示

id  event   enrolment date  Enrolment to    course   record
1   42      2012-07-01      2013-06-30      module 5    2
1   42      2018-07-01      2019-06-30    **module 4**  1
1   43      2012-07-01      2013-06-30      module 5    2
1   43      2018-07-01      2019-06-30    **module 4**  1
2   50      2017-04-01      2018-03-31    **module 5**  2
2   50      2017-07-01      2018-03-31      module 4    1
2   34      2017-04-01      2018-03-31    **module 5**  2
2   34      2017-07-01      2018-03-31      module 4    1
3   23      2014-08-20      2015-07-20      module 5    1
3   23      2014-08-20      2015-07-20      module 4    2
3   23      2015-07-04      2016-06-04  **withdrawn**   3
4   13      2017-09-01      2018-08-01      module 4    1
4   13      2017-09-01      2018-08-01  **module 5**    2
4   23      2017-09-01      2018-08-01      module 4    1
4   23      2017-09-01      2018-08-01  **module 5**    2

我想在数据框中保留第 2、4、5、7、11、13 和 15 行(教育错误地将模块 4 分配给事件 50 和 34

我希望数据如下所示

id  event   status_date Course  record
1   42  2018-07-01  module 4    1
1   43  2018-07-01  module 4    1
2   50  2017-04-01  module 5    2
2   34  2016-04-01  module 5    2
3   23  2015-07-04  withdrawn   3
4   13  2017-09-01  module 5    2
4   23  2017-09-01  module 5    2

标签: r

解决方案


如果我清楚地理解了所有要求,这里是一个在每组中选择正确日期的功能

library(dplyr)

select_dates <- function(start, end, course) {
    #If there is same date return course with "module5"
    if (n_distinct(start) == 1)
       which.max(course == "module5")
    else {
      #Get courses which are currently enrolled
      inds <- max(start) < end
      #If any course has "module5" and no "withdrawn"
      if (any(course[inds] == "module5") & all(course[inds] != "withdrawn")) 
            #return the course with "module5" which is currently enrolled
            which.max(inds & course == "module5")
      else
            #return the currently enrolled course with a max date
            which.max(start == max(start[inds]))
         }
}

然后我们将它应用于每个idevent

df %>%
  mutate_at(vars(enrolment_date, Enrolment_to), as.Date) %>%
  group_by(id, event) %>%
  slice(select_dates(enrolment_date, Enrolment_to, course))

#     id event enrolment_date Enrolment_to course    record
#  <int> <int> <date>         <date>       <chr>      <int>
#1     1    42 2018-07-01     2019-06-30   module4        1
#2     1    43 2018-07-01     2019-06-30   module4        1
#3     2    34 2017-04-01     2018-03-31   module5        2
#4     2    50 2017-04-01     2018-03-31   module5        2
#5     3    23 2015-07-04     2016-06-04   withdrawn      3
#6     4    13 2017-09-01     2018-08-01   module5        2
#7     4    23 2017-09-01     2018-08-01   module5        2

请注意,您需要根据数据中的内容更改函数中的字符串 ( "module5"and "withdrawn") 和列名 ( enrolment_dateand Enrolment_to)。

数据

df <- structure(list(id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 
3L, 4L, 4L, 4L, 4L), event = c(42L, 42L, 43L, 43L, 50L, 50L, 
34L, 34L, 23L, 23L, 23L, 13L, 13L, 23L, 23L), enrolment_date = c("2012-07-01", 
"2018-07-01", "2012-07-01", "2018-07-01", "2017-04-01", "2017-07-01", 
"2017-04-01", "2017-07-01", "2014-08-20", "2014-08-20", "2015-07-04", 
"2017-09-01", "2017-09-01", "2017-09-01", "2017-09-01"), Enrolment_to = c("2013-06-30", 
"2019-06-30", "2013-06-30", "2019-06-30", "2018-03-31", "2018-03-31", 
"2018-03-31", "2018-03-31", "2015-07-20", "2015-07-20", "2016-06-04", 
"2018-08-01", "2018-08-01", "2018-08-01", "2018-08-01"), course = c("module5", 
"module4", "module5", "module4", "module5", "module4", "module5", 
"module4", "module5", "module4", "withdrawn", "module4", "module5", 
"module4", "module5"), record = c(2L, 1L, 2L, 1L, 2L, 1L, 2L, 
1L, 1L, 2L, 3L, 1L, 2L, 1L, 2L)), class = "data.frame", row.names = c(NA, -15L))

推荐阅读