首页 > 解决方案 > 汇总重叠时间段的值

问题描述

我正在尝试总结重叠时间段的值。我只能使用 tidyr、ggplot2 和 dplyr 库。不过,碱基 R 是首选。

我的数据看起来像这样,但通常它有大约 100 条记录:

df <- structure(list(Start = structure(c(1546531200, 1546531200, 546531200, 1546638252.6316, 1546549800, 1546534800, 1546545600, 1546531200, 1546633120, 1547065942.1053), class = c("POSIXct", "POSIXt"), tzone = "UTC"), Stop = structure(c(1546770243.1579, 1546607400, 1547110800, 1546670652.6316, 1547122863.1579, 1546638252.6316, 1546878293.5579, 1546416000, 1546849694.4, 1547186400), class = c("POSIXct", "POSIXt"), tzone = "UTC"), Value = c(12610, 520, 1500, 90, 331380, 27300, 6072, 4200, 61488, 64372)), .Names = c("Start", "Stop", "Value"), row.names = c(41L, 55L, 25L, 29L, 38L, 28L, 1L, 20L, 14L, 31L), class = c("tbl_df", "tbl", "data.frame"))

head(df)str(df)给出:

          Start                Stop      Value
2019-01-03 16:00:00 2019-01-06 10:24:03  12610
2019-01-03 16:00:00 2019-01-04 13:10:00    520
2019-01-03 16:00:00 2019-01-10 09:00:00   1500
2019-01-04 21:44:12 2019-01-05 06:44:12     90
2019-01-03 21:10:00 2019-01-10 12:21:03 331380
2019-01-03 17:00:00 2019-01-04 21:44:12  27300

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   10 obs. of  3 variables:
 $ Start: POSIXct, format: "2019-01-03 16:00:00" "2019-01-03 16:00:00" ...
 $ Stop : POSIXct, format: "2019-01-06 10:24:03" "2019-01-04 13:10:00" ...
 $ Value: num  12610 520 1500 90 331380 ...

因此,具有指定值的“开始”和“停止”日期存在重叠的时间段。在任何给定的记录中,当有一个值介于此范围之间df$Startdf$Stop范围之外时,该值为 0。

我想创建另一个数据框,据此我可以展示这些值如何随着时间的推移而总结和变化。期望的输出看起来像这样(“总和”列是组成的):

> head(df2)
              timestamp     sum
"2019-01-02 09:00:00 CET"   14352
"2019-01-03 17:00:00 CET"   6253
"2019-01-03 18:00:00 CET"   23465
"2019-01-03 21:00:00 CET"   3241
"2019-01-03 22:10:00 CET"   23235
"2019-01-04 14:10:00 CET"   123321

要获得唯一的时间戳:

timestamps <- sort(unique(c(df$`Start`, df$`Stop`)))

使用df2数据框,我可以很容易地用 ggplot 绘制图表,但是如何获得这些总和?

我想我应该迭代df数据框,或者一些自定义函数或任何内置的汇总函数,这些函数可以像这样工作:

fnct <- function(date, min, max, value) {
  if (date >= min && date <=max) {
    a <- value
  }
  else {
  a <- 0
  }
  return(a)
}

...对于每个给定date的 fromtimestamps迭代df并给我一个timestamp.

它看起来很简单,我缺少一些非常基本的东西。

标签: r

解决方案


过去我曾尝试使用 tidyverse/baseR 来解决类似的问题......但没有任何东西能与data.table提供此类操作的速度相提并论,所以我鼓励你试一试......

对于这样的问题,我最喜欢foverlaps()的 finction 来自-package data.table。使用此功能,您可以(快速!)执行重叠连接。如果您希望加入比foverlaps()提供的灵活性更大,non-equi-join (再次使用data.table)可能是最好的(也是最快的!)选项。但foverlaps()会在这里做(我猜)。

我使用了您提供的示例数据,但过滤掉了 where 行Stop <= Start(可能是您的示例数据中的一个输入错误)。当df$Start不在之前df$Stop时,foverlaps给出警告并且不会执行。

library( data.table )

#create data.table with periods you wish to simmarise on
#NB: UTC is used as timezone, since this is also the case in the sample data provided!!
dt.dates <- data.table( id = paste0( "Day", 1:31 ),
                        Start = seq( as.POSIXct( "2019-01-01 00:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "UTC" ),
                                     as.POSIXct( "2019-01-31 00:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "UTC" ),
                                     by = "1 days"),
                        Stop = seq( as.POSIXct( "2019-01-02 00:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "UTC" ) - 1,
                                    as.POSIXct( "2019-02-01 00:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "UTC" ) - 1,
                                    by = "1 days") )

如果您不想每天进行总结,而是按小时、分钟、秒、年进行总结。只需更改dt.datesdata.table 中的值(和步长),使其与您的时期相匹配。

#set df as data.table
dt <- as.data.table( df )
#filter out any row where Stop is smaller than Start
dt <- dt[ Start < Stop, ]

#perform overlap join
#first set keys
setkey(dt, Start, Stop)
#then perform join
result <- foverlaps( dt.dates, dt, type = "within" )
#summarise
result[, .( Value = sum( Value , na.rm = TRUE ) ), by = .(Day = i.Start) ]

输出

#            Day  Value
#  1: 2019-01-01   1500
#  2: 2019-01-02   1500
#  3: 2019-01-03   1500
#  4: 2019-01-04 351562
#  5: 2019-01-05 413050
#  6: 2019-01-06 400440
#  7: 2019-01-07 332880
#  8: 2019-01-08 332880
#  9: 2019-01-09 332880
# 10: 2019-01-10  64372
# 11: 2019-01-11      0
# 12: 2019-01-12      0
# 13: 2019-01-13      0
# 14: 2019-01-14      0
# 15: 2019-01-15      0
# 16: 2019-01-16      0
# 17: 2019-01-17      0
# 18: 2019-01-18      0
# 19: 2019-01-19      0
# 20: 2019-01-20      0
# 21: 2019-01-21      0
# 22: 2019-01-22      0
# 23: 2019-01-23      0
# 24: 2019-01-24      0
# 25: 2019-01-25      0
# 26: 2019-01-26      0
# 27: 2019-01-27      0
# 28: 2019-01-28      0
# 29: 2019-01-29      0
# 30: 2019-01-30      0
# 31: 2019-01-31      0
#            Day  Value

阴谋

#summarise for plot
result.plot <- result[, .( Value = sum( Value , na.rm = TRUE ) ), by = .(Day = i.Start) ]
library( ggplot2 )
ggplot( data = result.plot, aes( x = Day, y = Value ) ) + geom_col()

在此处输入图像描述


推荐阅读