r - 使用 data.table 和 for-loop 提高代码执行时间效率
问题描述
问题:如何使下面代码中的 for 循环更高效地运行?对于这个玩具示例,它可以在合理的时间内工作。但是,unique_ids
将是一个大约 8000 个条目的向量,并且 for 循环会大大减慢计算速度。有任何想法吗?非常感谢!
目的: 根据for循环中的计算逻辑,将每天的IID追溯聚类为hop和top。
初始数据:
IID ENTRY FINISH TARGET max_finish_target_date
1: 1 2020-02-11 2020-02-19 2020-02-15 2020-02-19
2: 2 2020-02-13 2020-02-17 2020-02-19 2020-02-19
最终(目标)数据:
IID Dates ind_frist
1: 1 2020-02-10
2: 1 2020-02-11 hop
3: 1 2020-02-12 hop
4: 1 2020-02-13 hop
5: 1 2020-02-14 hop
6: 1 2020-02-15 hop
7: 1 2020-02-16 top
8: 1 2020-02-17 top
9: 1 2020-02-18 top
10: 1 2020-02-19 top
11: 2 2020-02-10
12: 2 2020-02-11
13: 2 2020-02-12
14: 2 2020-02-13 hop
15: 2 2020-02-14 hop
16: 2 2020-02-15 hop
17: 2 2020-02-16 hop
18: 2 2020-02-17 hop
19: 2 2020-02-18
20: 2 2020-02-19
21: 3 2020-02-10
22: 3 2020-02-11
23: 3 2020-02-12
24: 3 2020-02-13
25: 3 2020-02-14
26: 3 2020-02-15 hop
27: 3 2020-02-16 hop
28: 3 2020-02-17 top
29: 3 2020-02-18 top
30: 3 2020-02-19 top
代码
rm(list = ls())
library(data.table)
# Some sample start data
initial_dt <- data.table(IID = c(1, 2, 3),
ENTRY = c("2020-02-11", "2020-02-13", "2020-02-15"),
FINISH = c("2020-02-19", "2020-02-17", ""),
TARGET = c("2020-02-15", "2020-02-19", "2020-02-16"))
initial_dt[, ":="(ENTRY = ymd(ENTRY),
FINISH = ymd(FINISH),
TARGET = ymd(TARGET))]
initial_dt[is.na(FINISH), FINISH := as.Date(ymd_hms(Sys.time()), format = "%Y-%m-%d")]
initial_dt[, max_finish_target_date := pmax(FINISH, TARGET)]
# Specify target data shape and output format
unique_ids <- c(1, 2, 3)
dts <- seq(as.Date("2020-02-10", format = "%Y-%m-%d"), as.Date(ymd_hms(Sys.time()), format = "%Y-%m-%d"), by = "days")
ids <- rep(unique_ids, each = length(dts))
len <- length(unique_ids)
final_dt <- data.table(IID = ids,
Dates = rep(dts, times = len))
# Calculation logic
# QUESTION: How can I make this part below run more efficiently and less time costly?
for (d_id in unique_ids){
final_dt[(IID == d_id) & (Dates %between% c(initial_dt[IID == d_id, ENTRY], initial_dt[IID == d_id, max_finish_target_date])),
ind_frist := ifelse((Dates > initial_dt[IID == d_id, TARGET]) & (Dates <= initial_dt[IID == d_id, max_finish_target_date]),
"hop",
"top")]
}
解决方案
您的循环不会产生您显示的输出。以下非 equi 连接会产生该输出,但可以很容易地针对其他规则(例如来自for
循环的规则)进行调整:
final_dt <- CJ(IID = initial_dt[["IID"]], Dates = dts)
final_dt[initial_dt, ind_frist := "hop", on = .(IID, Dates >= ENTRY, Dates <= FINISH)]
final_dt[initial_dt, ind_frist := "top", on = .(IID, Dates > TARGET, Dates <= FINISH)]
这些连接应该非常快。
结果:
# IID Dates ind_frist
# 1: 1 2020-02-10 <NA>
# 2: 1 2020-02-11 hop
# 3: 1 2020-02-12 hop
# 4: 1 2020-02-13 hop
# 5: 1 2020-02-14 hop
# 6: 1 2020-02-15 hop
# 7: 1 2020-02-16 top
# 8: 1 2020-02-17 top
# 9: 1 2020-02-18 top
#10: 1 2020-02-19 top
#11: 2 2020-02-10 <NA>
#12: 2 2020-02-11 <NA>
#13: 2 2020-02-12 <NA>
#14: 2 2020-02-13 hop
#15: 2 2020-02-14 hop
#16: 2 2020-02-15 hop
#17: 2 2020-02-16 hop
#18: 2 2020-02-17 hop
#19: 2 2020-02-18 <NA>
#20: 2 2020-02-19 <NA>
#21: 3 2020-02-10 <NA>
#22: 3 2020-02-11 <NA>
#23: 3 2020-02-12 <NA>
#24: 3 2020-02-13 <NA>
#25: 3 2020-02-14 <NA>
#26: 3 2020-02-15 hop
#27: 3 2020-02-16 hop
#28: 3 2020-02-17 top
#29: 3 2020-02-18 top
#30: 3 2020-02-19 top
# IID Dates ind_frist
推荐阅读
- swift - 如何从 Realm 设置 UIPickerView 选定值?
- java - 错误 9952 --- [nio-8081-exec-1] ohengine.jdbc.spi.SqlExceptionHelper : ORA-00923: FROM 关键字未在预期位置找到
- reactjs - React Native 登录表单
- kubernetes - 由于 crashloopback,Pod 被终止
- azure - 是否可以依次启动各种 iot-edge 模块?如果是这样,该怎么做?
- .net - DDD 重复域逻辑
- java - 为什么 HBase2.0.0 中的 HRegionServer 会崩溃?
- amazon-web-services - 是否有用于读取文件而不是下载文件的 AWS S3 Go API?
- ios - 在我的证书中创建 ios 签名证书
- laravel - 如何从另一个表中获取所有用户数据并将它们包含在 laravel 的用户列表中?