首页 > 解决方案 > 定义要在 data.table 中的 ID 重复项中删除的变量

问题描述

我有以下数据集

DT <- data.table(
      id = c(1,2,3,4,4,5,6,6,7),
      date  = c("2013-11-22","2017-01-24","2020-02-10","2011-01-03"
               ,"2011-01-03","2012-04-03","2010-09-03","2010-09-03"
               ,"2010-05-03"),                         
      status = c("Never","Current","Former",NA,"Former"
                , NA,"Never","Former","Current")
     )

我想创建一个独特的id并删除重复项。

下面的示例输出:

    id  date      status 
1:  1 2013-11-22   Never 
2:  2 2017-01-24 Current 
3:  3 2020-02-10  Former 
4:  4 2011-01-03  Former 
5:  5 2012-04-03    <NA> 
6:  6 2010-09-03  Former 
7:  7 2010-05-03 Current

原始数据集有更多的行和列,一个data.table函数会节省时间。还有一些id出现不止一次。我之前尝试过以保持id最新日期。但是,我有太多的“NA”在更早的日期有另一个状态条目。

我如何定义应该保留的id相同status

标签: rduplicatesdata.table

解决方案


我们可以创建一个factorwithstatus指定levels,将其order与 'id' 一起使用并unique通过 'id'获取

library(data.table)
unique(DT[order(id, ordered(status, c("Former", "Current", "Never")))], by = 'id')

推荐阅读