r - 是否有 R 代码可以删除具有两个 - 三个不同列条件的相同 ID 的重复事件?
问题描述
我有一个包含数千个 id 的数据框,每个 id 和注册日期、课程和记录有几个事件。课程是分类的,模块1,模块2,模块3,模块4,模块5和撤回(任何模块)。例如,几行如下所示
id event enrolment date Enrolment to course record
1 42 2012-07-01 2013-06-30 module 5 2
1 42 2018-07-01 2019-06-30 **module 4** 1
1 43 2012-07-01 2013-06-30 module 5 2
1 43 2018-07-01 2019-06-30 **module 4** 1
2 50 2017-04-01 2018-03-31 **module 5** 2
2 50 2017-07-01 2018-03-31 module 4 1
2 34 2017-04-01 2018-03-31 **module 5** 2
2 34 2017-07-01 2018-03-31 module 4 1
3 23 2014-08-20 2015-07-20 module 5 1
3 23 2014-08-20 2015-07-20 module 4 2
3 23 2015-07-04 2016-06-04 **withdrawn** 3
4 13 2017-09-01 2018-08-01 module 4 1
4 13 2017-09-01 2018-08-01 **module 5** 2
4 23 2017-09-01 2018-08-01 module 4 1
4 23 2017-09-01 2018-08-01 **module 5** 2
我想在数据框中保留第 2、4、5、7、11、13 和 15 行(教育)错误地将模块 4 分配给事件 50 和 34
我希望数据如下所示
id event status_date Course record
1 42 2018-07-01 module 4 1
1 43 2018-07-01 module 4 1
2 50 2017-04-01 module 5 2
2 34 2016-04-01 module 5 2
3 23 2015-07-04 withdrawn 3
4 13 2017-09-01 module 5 2
4 23 2017-09-01 module 5 2
解决方案
如果我清楚地理解了所有要求,这里是一个在每组中选择正确日期的功能
library(dplyr)
select_dates <- function(start, end, course) {
#If there is same date return course with "module5"
if (n_distinct(start) == 1)
which.max(course == "module5")
else {
#Get courses which are currently enrolled
inds <- max(start) < end
#If any course has "module5" and no "withdrawn"
if (any(course[inds] == "module5") & all(course[inds] != "withdrawn"))
#return the course with "module5" which is currently enrolled
which.max(inds & course == "module5")
else
#return the currently enrolled course with a max date
which.max(start == max(start[inds]))
}
}
然后我们将它应用于每个id
和event
df %>%
mutate_at(vars(enrolment_date, Enrolment_to), as.Date) %>%
group_by(id, event) %>%
slice(select_dates(enrolment_date, Enrolment_to, course))
# id event enrolment_date Enrolment_to course record
# <int> <int> <date> <date> <chr> <int>
#1 1 42 2018-07-01 2019-06-30 module4 1
#2 1 43 2018-07-01 2019-06-30 module4 1
#3 2 34 2017-04-01 2018-03-31 module5 2
#4 2 50 2017-04-01 2018-03-31 module5 2
#5 3 23 2015-07-04 2016-06-04 withdrawn 3
#6 4 13 2017-09-01 2018-08-01 module5 2
#7 4 23 2017-09-01 2018-08-01 module5 2
请注意,您需要根据数据中的内容更改函数中的字符串 ( "module5"
and "withdrawn"
) 和列名 ( enrolment_date
and Enrolment_to
)。
数据
df <- structure(list(id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 4L, 4L, 4L, 4L), event = c(42L, 42L, 43L, 43L, 50L, 50L,
34L, 34L, 23L, 23L, 23L, 13L, 13L, 23L, 23L), enrolment_date = c("2012-07-01",
"2018-07-01", "2012-07-01", "2018-07-01", "2017-04-01", "2017-07-01",
"2017-04-01", "2017-07-01", "2014-08-20", "2014-08-20", "2015-07-04",
"2017-09-01", "2017-09-01", "2017-09-01", "2017-09-01"), Enrolment_to = c("2013-06-30",
"2019-06-30", "2013-06-30", "2019-06-30", "2018-03-31", "2018-03-31",
"2018-03-31", "2018-03-31", "2015-07-20", "2015-07-20", "2016-06-04",
"2018-08-01", "2018-08-01", "2018-08-01", "2018-08-01"), course = c("module5",
"module4", "module5", "module4", "module5", "module4", "module5",
"module4", "module5", "module4", "withdrawn", "module4", "module5",
"module4", "module5"), record = c(2L, 1L, 2L, 1L, 2L, 1L, 2L,
1L, 1L, 2L, 3L, 1L, 2L, 1L, 2L)), class = "data.frame", row.names = c(NA, -15L))
推荐阅读
- asp.net - ASP.Net WebForms - 如何在使用异步代码 + Response.Redirect 时缓解 Thread Aborted 异常
- sql - 在 where 子句中使用子查询从表中选择第二大日期
- c# - 如何在 C# 中加载单个代码片段?
- javascript - React/Javascript:最大金额验证
- java - 在java中排序列表并寻找最大百分比
- c# - 处理请求 .Net 核心时发生未处理的异常
- reactjs - 反应 DataGrid 单元格事件
- swift - 为 Swift 更新适用于 iOS 14 的 Google IMA SDK 隐私政策
- python - Tkinter 按钮未显示在不同的屏幕上
- javascript - 通过 Javascript 将 HTML 下拉菜单值传递给 HTML 输入