r - 对重叠时间间隔中的数据求和
问题描述
我正在尝试为一组以领先/滞后方式运行的过滤器创建汇总统计信息。
关于领先/滞后的简短描述:
当一个新的过滤器上线时,它被置于滞后位置,这意味着水在通过初级(又名铅)过滤器后通过它。当前置过滤器堵塞时,当前的滞后过滤器移动到前置位置。总而言之,过滤器从滞后位置开始,然后撞到领先位置。
在视觉上,你可以想象它是这样的:
我需要做的是总结单个过滤器在线的整个时间,无论是领先还是落后。
这是示例数据:
structure(list(record_timestamp = structure(c(1608192000, 1608192060,1608192120, 1608192180, 1608192240, 1608192300, 1608192360, 1608192420,1608192480, 1608192540, 1608192600, 1608192660, 1608192720, 1608192780,1608192840, 1608192900, 1608192960, 1608193020, 1608193080, 1608193140,1608193200, 1608193260, 1608193320, 1608193380, 1608193440, 1608193500,1608193560, 1608193620, 1608193680, 1608193740, 1608193800), class = c("POSIXct","POSIXt"), tzone = "UTC"), flow = c(20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10), lag_start = structure(c(1608192000,1608192000, 1608192000, 1608192000, 1608192000, 1608192000, 1608192000,1608192000, 1608192000, 1608192000, 1608192000, 1608192660, 1608192660,1608192660, 1608192660, 1608192660, 1608192660, 1608192660, 1608192660,1608192660, 1608192660, 1608193260, 1608193260, 1608193260, 1608193260,1608193260, 1608193260, 1608193260, 1608193260, 1608193260, 1608193260), class = c("POSIXct", "POSIXt"), tzone = "UTC"), lead_start = c("#N/A","#N/A", "#N/A", "#N/A", "#N/A", "#N/A", "#N/A", "#N/A", "#N/A","#N/A", "#N/A", "12/17/2020 8:11", "12/17/2020 8:11", "12/17/2020 8:11","12/17/2020 8:11", "12/17/2020 8:11", "12/17/2020 8:11", "12/17/2020 8:11","12/17/2020 8:11", "12/17/2020 8:11", "12/17/2020 8:11", "12/17/2020 8:21","12/17/2020 8:21", "12/17/2020 8:21", "12/17/2020 8:21", "12/17/2020 8:21","12/17/2020 8:21", "12/17/2020 8:21", "12/17/2020 8:21", "12/17/2020 8:21","12/17/2020 8:21")), class = c("spec_tbl_df", "tbl_df", "tbl","data.frame"), row.names = c(NA, -31L), spec = structure(list(cols = list(record_timestamp = structure(list(), class = c("collector_character","collector")), flow = structure(list(), class = c("collector_double","collector")), polish_start = structure(list(), class = c("collector_character", "collector")), lead_start = structure(list(), class = c("collector_character","collector"))), default = structure(list(), class = c("collector_guess","collector")), skip = 1), class = "col_spec"))
我的想法是“取消嵌套”它们并接受会有重复的时间戳,但每一行只会与一个过滤器相关联。关于如何做到这一点的任何想法?未嵌套的 DF 看起来像:
structure(list(record_timestamp = structure(c(1608192000, 1608192060,1608192120, 1608192180, 1608192240, 1608192300, 1608192360, 1608192420,1608192480, 1608192540, 1608192600, 1608192660, 1608192720, 1608192780,1608192840, 1608192900, 1608192960, 1608193020, 1608193080, 1608193140,1608193200, 1608192660, 1608192720, 1608192780, 1608192840, 1608192900,1608192960, 1608193020, 1608193080, 1608193140, 1608193200, 1608193260,1608193320, 1608193380, 1608193440, 1608193500, 1608193560,1608193620,1608193680, 1608193740, 1608193800), class = c("POSIXct", "POSIXt"), tzone = "UTC"), flow = c(20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), lag_start = structure(c(1608192000, 1608192000, 1608192000,1608192000, 1608192000, 1608192000, 1608192000, 1608192000,1608192000,1608192000, 1608192000, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1608192660, 1608192660, 1608192660, 1608192660, 1608192660, 1608192660,1608192660, 1608192660, 1608192660, 1608192660, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), class = c("POSIXct", "POSIXt"), tzone = "UTC"), lead_start = structure(c(NA, NA, NA, NA, NA, NA, NA, NA,NA, NA, NA, 1608192660, 1608192660, 1608192660, 1608192660,1608192660, 1608192660, 1608192660, 1608192660, 1608192660,1608192660, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1608193260,1608193260, 1608193260, 1608193260, 1608193260, 1608193260,1608193260, 1608193260, 1608193260, 1608193260), class = c("POSIXct","POSIXt"), tzone = "UTC"), filter_id = c(1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2)), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -41L), spec = structure(list(cols = list(record_timestamp = structure(list(), class = c("collector_character","collector")), flow = structure(list(), class = c("collector_double","collector")), polish_start = structure(list(), class = c("collector_character","collector")), lead_start = structure(list(), class = c("collector_character", "collector")), filter_id = structure(list(), class = c("collector_double","collector"))), default = structure(list(), class = c("collector_guess","collector")), skip = 1), class = "col_spec"))
然而,我意识到,这将使我正在使用的数据的大小增加一倍,这已经是几年的一分钟数据了。因此,如果有一种方法可以在不加倍时间戳的情况下做到这一点,那将是首选。
最后,最终目标是有一个小的摘要 DF,如下所示:
Filter ID | Total Flow
----------------------------
1 | 370
2 | 250
... | ...
解决方案
感谢您提供更多信息。看来你可以独处group_by
的时间。lag_start
然后,您可以计算flow
处于该位置时的总数(领先或落后)。之后,您可以按顺序分配过滤器编号,然后总过滤器flow
将是flow
当前行和下一行的总和。这是否给出了预期的结果?
df %>%
group_by(lag_start) %>%
summarise(flow_per_position = sum(flow)) %>%
mutate(filter_id = row_number(),
total_filter_flow = flow_per_position + lead(flow_per_position, default = 0))
输出
lag_start flow_per_position filter_id total_filter_flow
<dttm> <dbl> <int> <dbl>
1 2020-12-17 08:00:00 220 1 370
2 2020-12-17 08:11:00 150 2 250
3 2020-12-17 08:21:00 100 3 100
推荐阅读
- minio - 使用 dotnet 客户端在 minio 中上传不可搜索的流
- python - 如何从我创建的函数创建循环和新数据集?
- apache-camel - Sql 组件:使用多行并使用 onConsume 将它们全部标记为已处理
- ios - 从 xcode 9.2 升级到 xcode 9.4.1 后链接器命令失败
- firebase - 使用查询时出现 KotlinNullException
- git - Intellij:无法运行未找到的 Git 文件:/usr/local/bin/git
- groovy - Groovy 私有字段
- php - sql积分系统
- elasticsearch - 搜索保护中的单一权限
- python - 为什么 Scipy 对这个矩阵的特征值得到不同的答案?