首页 > 解决方案 > Postgresql 连续时间之间的平均间隔——按星期几

问题描述

我试图获得 /difference of times / 沿我的时间戳变量的平均值和中位数,以开发相对于前一个时间戳的数据是否“迟到”的预期。

也就是说——现在和最近的时间戳之间的差异是否超过了当天时间戳与其前一个时间戳之间的平均时间间隔?(或多或少,我希望创建到达过程的 lambda/mu 项,具体来说,我希望这个平均值/中位数对星期几敏感,因为每个 DOW的到达过程看起来都不同。 )

我的表如下所示:

create table Events (
  id serial NOT NULL PRIMARY KEY,
  datetime timestamptz NOT NULL DEFAULT CURRENT_TIMESTAMP
  );
  

insert into Events (datetime) VALUES 
('2021-03-28 00:00:00-06'),  -- say sunday is sparse, look it's every 8 hours-ISH, GIVE OR TAKE....
('2021-03-28 08:00:00-06'),
('2021-03-28 16:00:00-06'),
('2021-03-28 23:59:00-06'),

('2021-04-04 01:00:00-06'),
('2021-04-04 09:00:00-06'),
('2021-04-04 16:30:00-06'),

('2021-03-29 00:05:00-06'),  -- say monday is dense, look, it's approximately hourly
('2021-03-29 01:04:00-06'),
('2021-03-29 02:03:00-06'),
('2021-03-29 03:02:00-06'),
('2021-03-29 04:01:00-06'),
('2021-03-29 05:00:00-06'),
('2021-03-29 06:01:00-06'),
('2021-03-29 07:02:00-06'),
('2021-03-29 08:03:00-06'),
('2021-03-29 09:04:00-06'),
('2021-03-29 10:05:00-06'),
('2021-03-29 11:01:00-06'),
('2021-03-29 12:02:00-06'),
('2021-03-29 13:03:00-06'),
('2021-03-29 14:04:00-06'),
('2021-03-29 15:05:00-06'),
('2021-03-29 16:04:00-06'),
('2021-03-29 17:03:00-06'),
('2021-03-29 18:02:00-06'),
('2021-03-29 19:01:00-06'),
('2021-03-29 20:01:00-06'),
('2021-03-29 21:02:00-06'),
('2021-03-29 22:03:00-06'),
('2021-03-29 23:05:00-06'),
('2021-03-29 23:59:00-06'),


('2021-04-05 00:04:00-06'),
('2021-04-05 01:03:00-06'),
('2021-04-05 02:02:00-06'),
('2021-04-05 03:01:00-06'),
('2021-04-05 04:01:00-06'),
('2021-04-05 05:03:00-06'),
('2021-04-05 06:04:00-06'),
('2021-04-05 07:01:00-06'),
('2021-04-05 08:02:00-06'),
('2021-04-05 09:03:00-06'),
('2021-04-05 10:04:00-06'),
('2021-04-05 11:05:00-06'),
('2021-04-05 12:02:00-06'),
('2021-04-05 13:03:00-06'),
('2021-04-05 14:04:00-06'),
('2021-04-05 15:02:00-06'),
('2021-04-05 16:05:00-06'),
('2021-04-05 17:03:00-06'),
('2021-04-05 18:02:00-06'),
('2021-04-05 19:01:00-06'),
('2021-04-05 20:01:00-06'),
('2021-04-05 21:02:00-06'),
('2021-04-05 22:03:00-06'),
('2021-04-05 23:05:00-06'),
('2021-04-05 23:59:00-06'
); 

我的暂定查询如下所示:

with subset as (
  select * from Events
  where extract (dow from datetime::timestamptz AT TIME ZONE 'America/Chicago'  = '0') -- subset on today is sunday
)
select 
  LAG(datetime,1) over (ORDER BY datetime) as lagged_interval,
  extract (epoch from AVG(LAG(datetime,1) over (ORDER BY datetime))) as average_interval_in_seconds,
  extract (epoch from percentile_cont(0.5) within group (LAG(datetime,1) over (ORDER BY datetime))) as median_interval_in_seconds,
  max(datetime) as most_recent
from subset;

你可以在 db-fiddle中看到这个。

我难过的地方是:

  1. LAG是区分时间戳与前一个时间戳的正确运算符吗?这是一种 rowid 函数吗?不同或更好的窗口函数?我现在得到的错误是:

    查询错误:错误:“LAG”处或附近的语法错误

但是我对此有点困惑,因为根据那个文档它看起来是正确的。

  1. 如何处理分区中一天中的第一次和最后一次?如果我只是对周日做一个子集(有时区转换)不希望本周周日的开始回顾上周周日的结束,那会太长,我应该在我的外部添加一个 whereFROM子句排除第一个观察,除了滞后?

更新

我认为这个查询更接近,它至少返回:

with subset as (
select * 
from Events
where extract (dow from datetime::timestamptz AT TIME ZONE 'America/Chicago')  = '0' -- subset on today is sunday
)
(select
  max(most_recent),
  AVG(ss.diff)/60 as avg_in_minutes,
  percentile_cont(0.5) 
    within group(order by ss.diff)/60 as q50_in_minutes,
  percentile_cont(0.05) 
    within group(order by ss.diff)/60 as q05_in_minutes,
  percentile_cont(0.95) 
    within group(order by ss.diff)/60 as q95_in_minutes
from (
  select 
  max(datetime) over (order by datetime) as most_recent,
  extract(epoch from datetime) as t1,
  extract(epoch from LAG(datetime,1) over (ORDER BY datetime)) as t0,
  extract(epoch from datetime) - extract(epoch from LAG(datetime,1) over (ORDER BY datetime)) as diff
  from subset
) as ss);

结果看起来不错!

最大限度 avg_in_minutes q50_in_minutes q05_in_minutes q95_in_minutes
2021-04-04T22:30:00.000Z 2214 480 456 7439.999999999998

通过:https ://www.db-fiddle.com/f/PqdiwBNnnKx3bUw1ALgt2/0

但是,这个查询真是一团糟!现在我不清楚如何在我的子集中对分类变量进行分区以更普遍地做到这一点。例如,如果我的起始表有一个分类变量:

create table Events (
  id serial NOT NULL PRIMARY KEY,
  datetime timestamptz NOT NULL DEFAULT CURRENT_TIMESTAMP,
  category text default 'A'
  );

我希望这些组按类别划分:https ://www.db-fiddle.com/f/xr6JjG4KuuNVFLJCTof9UL/0

标签: postgresqlintervalsdate-difference

解决方案


尝试这个:

架构 (PostgreSQL v13)

create table Events (
  id serial NOT NULL PRIMARY KEY,
  datetime timestamptz NOT NULL DEFAULT CURRENT_TIMESTAMP,
  category text default 'A'
  );
  
 insert into Events (datetime, category) VALUES 
('2021-03-28 00:00:00-06', 'B'),
('2021-03-28 00:02:00-06', 'B');

  
insert into Events (datetime) VALUES 
('2021-03-28 00:00:00-06'),  -- say sunday is sparse, look it's every 8 hours-ISH, GIVE OR TAKE....
('2021-03-28 08:00:00-06'),
('2021-03-28 16:00:00-06'),
('2021-03-28 23:59:00-06'),

('2021-04-04 01:00:00-06'),
('2021-04-04 09:00:00-06'),
('2021-04-04 16:30:00-06'),

('2021-03-29 00:05:00-06'),  -- say monday is dense, look, it's hourly
('2021-03-29 01:04:00-06'),
('2021-03-29 02:03:00-06'),
('2021-03-29 03:02:00-06'),
('2021-03-29 04:01:00-06'),
('2021-03-29 05:00:00-06'),
('2021-03-29 06:01:00-06'),
('2021-03-29 07:02:00-06'),
('2021-03-29 08:03:00-06'),
('2021-03-29 09:04:00-06'),
('2021-03-29 10:05:00-06'),
('2021-03-29 11:01:00-06'),
('2021-03-29 12:02:00-06'),
('2021-03-29 13:03:00-06'),
('2021-03-29 14:04:00-06'),
('2021-03-29 15:05:00-06'),
('2021-03-29 16:04:00-06'),
('2021-03-29 17:03:00-06'),
('2021-03-29 18:02:00-06'),
('2021-03-29 19:01:00-06'),
('2021-03-29 20:01:00-06'),
('2021-03-29 21:02:00-06'),
('2021-03-29 22:03:00-06'),
('2021-03-29 23:05:00-06'),
('2021-03-29 23:59:00-06'),


('2021-04-05 00:04:00-06'),
('2021-04-05 01:03:00-06'),
('2021-04-05 02:02:00-06'),
('2021-04-05 03:01:00-06'),
('2021-04-05 04:01:00-06'),
('2021-04-05 05:03:00-06'),
('2021-04-05 06:04:00-06'),
('2021-04-05 07:01:00-06'),
('2021-04-05 08:02:00-06'),
('2021-04-05 09:03:00-06'),
('2021-04-05 10:04:00-06'),
('2021-04-05 11:05:00-06'),
('2021-04-05 12:02:00-06'),
('2021-04-05 13:03:00-06'),
('2021-04-05 14:04:00-06'),
('2021-04-05 15:02:00-06'),
('2021-04-05 16:05:00-06'),
('2021-04-05 17:03:00-06'),
('2021-04-05 18:02:00-06'),
('2021-04-05 19:01:00-06'),
('2021-04-05 20:01:00-06'),
('2021-04-05 21:02:00-06'),
('2021-04-05 22:03:00-06'),
('2021-04-05 23:05:00-06'),
('2021-04-05 23:59:00-06'
); 

查询 #1

with subset as (
select * 
from Events
where extract (dow from datetime::timestamptz AT TIME ZONE 'America/Chicago')  = '0' 
)
(select 
  category,
  max(most_recent),
  AVG(ss.diff)/60 as avg_in_minutes,
  percentile_cont(0.5) 
    within group(order by ss.diff)/60 as q50_in_minutes,
  percentile_cont(0.05) 
    within group(order by ss.diff)/60 as q05_in_minutes,
  percentile_cont(0.95) 
    within group(order by ss.diff)/60 as q95_in_minutes
from (
  select 
  category,
  max(datetime) over (partition by category order by datetime) as most_recent,
  extract(epoch from datetime) as t1,
  extract(epoch from LAG(datetime,1) over (ORDER BY datetime)) as t0,
  extract(epoch from datetime) - extract(epoch from LAG(datetime,1) over (ORDER BY datetime)) as diff
  from subset
) as ss
group by category);
类别 最大限度 avg_in_minutes q50_in_minutes q05_in_minutes q95_in_minutes
一个 2021-04-04T22:30:00.000Z 1844.6666666666667 479 112.5 7005
2021-03-28T06:02:00.000Z 2 2 2 2

在 DB Fiddle 上查看


推荐阅读