首页 > 解决方案 > 在两个时间戳之间左加入 R

问题描述

我的目标是对匹配项和时间戳在BETWEEN和表中的intervals位置执行左连接bike_idcreated_atrecordsstartendintervals

> class(records)
[1] "data.table" "data.frame"
> class(intervals)
[1] "data.table" "data.frame"

> records
  bike_id          created_at         resolved_at
1   28780 2019-05-03 08:29:18 2019-05-03 08:35:37
2   28780 2019-05-03 21:05:28 2019-05-03 21:07:28
3   28780 2019-05-04 21:13:39 2019-05-04 21:15:40
4   28780 2019-05-07 17:24:20 2019-05-07 17:26:39
5   28780 2019-05-08 11:34:32 2019-05-08 12:16:44
6   28780 2019-05-08 23:38:39 2019-05-08 23:40:36


> intervals
   bike_id               start                 end id
1:   28780 2019-05-03 04:44:45 2019-05-03 16:58:56  1
2:   28780 2019-05-04 07:07:39 2019-05-04 14:48:29  2
3:   28780 2019-05-07 23:28:32 2019-05-08 12:56:24  3
4:   28780 2019-05-10 06:06:21 2019-05-10 13:12:08  4
5:   28780 2019-05-12 05:21:24 2019-05-12 11:35:52  5
6:   28780 2019-05-13 08:44:54 2019-05-13 12:28:31  6

在这种情况下,输出看起来像

> output
  bike_id          created_at         resolved_at   id
1   28780 2019-05-03 08:29:18 2019-05-03 08:35:37    1
2   28780 2019-05-03 21:05:28 2019-05-03 21:07:28  NULL   
3   28780 2019-05-04 21:13:39 2019-05-04 21:15:40  NULL
4   28780 2019-05-07 17:24:20 2019-05-07 17:26:39  NULL
5   28780 2019-05-08 11:34:32 2019-05-08 12:16:44  NULL
6   28780 2019-05-08 23:38:39 2019-05-08 23:40:36  NULL

我已经尝试使用这里发布的解决方案,但这tidyverse会导致 R 内存不足(尽管两个表中的记录量只有大约 100K)

fuzzy_left_join(
 records, intervals,
  by = c(
    "bike_id" = "bike_id",
    "created_at" = "start",
    "created_at" = "end"
    ),
  match_fun = list(`==`, `>=`, `<=`)
  ) %>%
  select(id, bike_id = bike_id.x, created_at, start, end)

这会引发错误:Error: vector memory exhausted (limit reached?)

是否有另一种方法可以使用滚动加入data.table甚至在基础 R 中使用merge()?什么是通过 id 连接两个数据帧的好方法以及连接表中其他两个数据帧之间的时间戳?

这是数据

dput(intervals)
structure(list(bike_id = c(28780L, 28780L, 28780L, 28780L, 28780L, 
28780L), start = structure(c(1556858685, 1556953659, 1557271712, 
1557468381, 1557638484, 1557737094), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), end = structure(c(1556902736, 1556981309, 
1557320184, 1557493928, 1557660952, 1557750511), class = c("POSIXct", 
"POSIXt"), tzone = "UTC"), id = c(1, 2, 3, 4, 5, 6)), row.names = c(NA, 
-6L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x1030056e0>)

dput(records)
structure(list(bike_id = c(28780L, 28780L, 28780L, 28780L, 28780L, 
28780L), created_at = structure(c(1556872158.796, 1556917528.845, 
1557004419.928, 1557249860.939, 1557315272.396, 1557358719.333
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), resolved_at = structure(c(1556872537.867, 
1556917648.118, 1557004540.056, 1557249999.892, 1557317804.183, 
1557358836.202), class = c("POSIXct", "POSIXt"), tzone = "UTC")), row.names = c(NA, 
6L), class = "data.frame")

标签: rdplyrdata.tabletidyverse

解决方案


我们可以使用data.tablenonequi join

library(data.table)
setDT(records)[intervals, on = .(bike_id, created_at >= start, created_at <= end)]

推荐阅读