首页 > 解决方案 > 按时间窗口条件组替换缺失值

问题描述

我按 分组subject_id

变量 acpa 在 所在的行上测量row_type == BIOBANK。我想替换<NA>其他行上的 ,但只能在给定的时间窗口中,使用该index_date变量。

这是我的桌子:

   subject_id index_date      row_type acpa
 1:        155 2010-05-12 BASELINE_info <NA>
 2:        155 2010-05-12 BASELINE_info <NA>
 3:        155 2010-05-12       BIOBANK    N
 4:        155 2010-05-12 PHYSICAL_exam <NA>
 5:        155 2010-09-29     FOLLOW_UP <NA>
 6:        155 2011-11-30     FOLLOW_UP <NA>
 7:        155 2013-06-01     FOLLOW_UP <NA>
 8:        155 2014-06-01     FOLLOW_UP <NA>
 9:        155 2015-06-01     FOLLOW_UP <NA>
10:        155 2016-08-31     FOLLOW_UP <NA>
11:        568 2012-04-07 BASELINE_info <NA>
12:        568 2012-04-07 BASELINE_info <NA>
13:        568 2012-04-07     FOLLOW_UP <NA>
14:        568 2012-04-19       BIOBANK    H
15:        568 2012-04-19 PHYSICAL_exam <NA>
16:        568 2013-06-01     FOLLOW_UP <NA>
17:        568 2013-12-12       BIOBANK    H
18:        568 2013-12-12 PHYSICAL_exam <NA>
19:        568 2014-04-27     FOLLOW_UP <NA>
20:        568 2014-05-22       RA_DIAG <NA>
21:        568 2014-05-28 PHYSICAL_exam <NA>
22:        568 2018-11-06       BIOBANK    L
23:        568 2018-11-06 PHYSICAL_exam <NA>
24:        568 2018-11-27     FOLLOW_UP <NA>
25:        568 2019-06-20     FOLLOW_UP <NA>
    subject_id index_date      row_type acpa

例如,在第 3 行,acpa=“N”。我想执行诸如应用之类的操作na.locf(),但仅适用于对应的index_date.

library(date.table)
library(zoo)
TABLE[, acpa := na.locf(acpa, na.rm = F), by = subject_id] 

我怎么能建立一个rule论据na.locf()所以它只能填补我的时间窗口中的空白?

我想要的输出如下所示:

   subject_id index_date      row_type acpa
 1:        155 2010-05-12 BASELINE_info    N
 2:        155 2010-05-12 BASELINE_info    N
 3:        155 2010-05-12       BIOBANK    N
 4:        155 2010-05-12 PHYSICAL_exam    N
 5:        155 2010-09-29     FOLLOW_UP    N
 6:        155 2011-11-30     FOLLOW_UP <NA>
 7:        155 2013-06-01     FOLLOW_UP <NA>
 8:        155 2014-06-01     FOLLOW_UP <NA>
 9:        155 2015-06-01     FOLLOW_UP <NA>
10:        155 2016-08-31     FOLLOW_UP <NA>
11:        568 2012-04-07 BASELINE_info    H
12:        568 2012-04-07 BASELINE_info    H
13:        568 2012-04-07     FOLLOW_UP    H
14:        568 2012-04-19       BIOBANK    H
15:        568 2012-04-19 PHYSICAL_exam    H
16:        568 2013-06-01     FOLLOW_UP    H
17:        568 2013-12-12       BIOBANK    H
18:        568 2013-12-12 PHYSICAL_exam    H
19:        568 2014-04-27     FOLLOW_UP    H
20:        568 2014-05-22       RA_DIAG    H
21:        568 2014-05-28 PHYSICAL_exam    H
22:        568 2018-11-06       BIOBANK    L
23:        568 2018-11-06 PHYSICAL_exam    L
24:        568 2018-11-27     FOLLOW_UP    L
25:        568 2019-06-20     FOLLOW_UP    L
    subject_id index_date      row_type acpa

如果由于时间窗口重叠而发生冲突,我会保留最后一个(之前的)非 NA 的值。

标签: rdataframeconditional-statementsna

解决方案


我的数据经理同事提出了这个建议:

从前瞻性预测开始。

    DT[!is.na(acpa), variable_t := index_date] # variable_t copied as the index date of the variable to impute
    DT[, acpa_c := na.locf(acpa, na.rm = F), by = subject_id] # locf for the variable in a new variable
    DT[, variable_t := na.locf(variable_t, na.rm = F), by = subject_id] # locf of the variable_time
    DT[, delay := as.numeric(difftime(index_date, variable_t ,units = "days"))] # calculating on each row the delay 
    DT[delay > 365, acpa_c := NA] # cutting off values above my threshold

然后,向后投影也是如此。

    DT[, acpa_c := na.locf(acpa_c , na.rm = F, fromLast = T),by = subject_id] ### na.lofb, in the corrected variable
    DT[!is.na(acpa), variable_t := index_date] # reset of the temporary variable_time, so we can projet it backward
    DT[, variable_t := na.locf(ratio_t, na.rm = F, fromLast = T),by = subject_id] # locb of the variable_time
    DT[, delay := as.numeric(difftime(index_date, variable_t , units = "days"))] # delay
    DT[delay < -180, acpa_c := NA] # cutting off values above my threshold

如果一切正常,则用新值替换。

DT[, acpa := acpa_c]
DT[, acpa_c := NULL]

推荐阅读