首页 > 解决方案 > 使用 data.table 和 for-loop 提高代码执行时间效率

问题描述

问题:如何使下面代码中的 for 循环更高效地运行?对于这个玩具示例,它可以在合理的时间内工作。但是,unique_ids将是一个大约 8000 个条目的向量,并且 for 循环会大大减慢计算速度。有任何想法吗?非常感谢!

目的: 根据for循环中的计算逻辑,将每天的IID追溯聚类为hop和top。

初始数据:

   IID      ENTRY     FINISH     TARGET max_finish_target_date
1:      1 2020-02-11 2020-02-19 2020-02-15             2020-02-19
2:      2 2020-02-13 2020-02-17 2020-02-19             2020-02-19

最终(目标)数据:

 IID      Dates    ind_frist
 1:      1 2020-02-10             
 2:      1 2020-02-11 hop
 3:      1 2020-02-12 hop
 4:      1 2020-02-13 hop
 5:      1 2020-02-14 hop
 6:      1 2020-02-15 hop
 7:      1 2020-02-16 top
 8:      1 2020-02-17 top
 9:      1 2020-02-18 top
10:      1 2020-02-19 top
11:      2 2020-02-10             
12:      2 2020-02-11             
13:      2 2020-02-12             
14:      2 2020-02-13 hop
15:      2 2020-02-14 hop
16:      2 2020-02-15 hop
17:      2 2020-02-16 hop
18:      2 2020-02-17 hop
19:      2 2020-02-18             
20:      2 2020-02-19             
21:      3 2020-02-10             
22:      3 2020-02-11             
23:      3 2020-02-12             
24:      3 2020-02-13             
25:      3 2020-02-14             
26:      3 2020-02-15 hop
27:      3 2020-02-16 hop
28:      3 2020-02-17 top
29:      3 2020-02-18 top
30:      3 2020-02-19 top

代码

rm(list = ls())

library(data.table)

# Some sample start data
initial_dt <- data.table(IID = c(1, 2, 3),
                         ENTRY = c("2020-02-11", "2020-02-13", "2020-02-15"),
                         FINISH = c("2020-02-19", "2020-02-17", ""),
                         TARGET = c("2020-02-15", "2020-02-19", "2020-02-16"))

initial_dt[, ":="(ENTRY = ymd(ENTRY),
                  FINISH = ymd(FINISH),
                  TARGET = ymd(TARGET))]

initial_dt[is.na(FINISH), FINISH := as.Date(ymd_hms(Sys.time()), format = "%Y-%m-%d")]


initial_dt[, max_finish_target_date := pmax(FINISH, TARGET)]


# Specify target data shape and output format
unique_ids <- c(1, 2, 3) 

dts <- seq(as.Date("2020-02-10", format = "%Y-%m-%d"), as.Date(ymd_hms(Sys.time()), format = "%Y-%m-%d"), by = "days")

ids <- rep(unique_ids, each = length(dts))
len <- length(unique_ids)

final_dt <- data.table(IID = ids,
                       Dates = rep(dts, times = len))

# Calculation logic
# QUESTION: How can I make this part below run more efficiently and less time costly?
for (d_id in unique_ids){
  final_dt[(IID == d_id) & (Dates %between% c(initial_dt[IID == d_id, ENTRY], initial_dt[IID == d_id, max_finish_target_date])), 
                ind_frist := ifelse((Dates > initial_dt[IID == d_id, TARGET]) & (Dates <= initial_dt[IID == d_id, max_finish_target_date]), 
                                    "hop", 
                                    "top")]
}

标签: rperformancefor-loopdata.table

解决方案


您的循环不会产生您显示的输出。以下非 equi 连接会产生该输出,但可以很容易地针对其他规则(例如来自for循环的规则)进行调整:

final_dt <- CJ(IID = initial_dt[["IID"]], Dates = dts)
final_dt[initial_dt, ind_frist := "hop", on = .(IID, Dates >= ENTRY, Dates <= FINISH)]
final_dt[initial_dt, ind_frist := "top", on = .(IID, Dates > TARGET, Dates <= FINISH)]

这些连接应该非常快。

结果:

#    IID      Dates ind_frist
# 1:   1 2020-02-10      <NA>
# 2:   1 2020-02-11       hop
# 3:   1 2020-02-12       hop
# 4:   1 2020-02-13       hop
# 5:   1 2020-02-14       hop
# 6:   1 2020-02-15       hop
# 7:   1 2020-02-16       top
# 8:   1 2020-02-17       top
# 9:   1 2020-02-18       top
#10:   1 2020-02-19       top
#11:   2 2020-02-10      <NA>
#12:   2 2020-02-11      <NA>
#13:   2 2020-02-12      <NA>
#14:   2 2020-02-13       hop
#15:   2 2020-02-14       hop
#16:   2 2020-02-15       hop
#17:   2 2020-02-16       hop
#18:   2 2020-02-17       hop
#19:   2 2020-02-18      <NA>
#20:   2 2020-02-19      <NA>
#21:   3 2020-02-10      <NA>
#22:   3 2020-02-11      <NA>
#23:   3 2020-02-12      <NA>
#24:   3 2020-02-13      <NA>
#25:   3 2020-02-14      <NA>
#26:   3 2020-02-15       hop
#27:   3 2020-02-16       hop
#28:   3 2020-02-17       top
#29:   3 2020-02-18       top
#30:   3 2020-02-19       top
#    IID      Dates ind_frist

推荐阅读