postgresql - Postgresql 连续时间之间的平均间隔——按星期几
问题描述
我试图获得 /difference of times / 沿我的时间戳变量的平均值和中位数,以开发相对于前一个时间戳的数据是否“迟到”的预期。
也就是说——现在和最近的时间戳之间的差异是否超过了当天时间戳与其前一个时间戳之间的平均时间间隔?(或多或少,我希望创建到达过程的 lambda/mu 项,具体来说,我希望这个平均值/中位数对星期几敏感,因为每个 DOW的到达过程看起来都不同。 )
我的表如下所示:
create table Events (
id serial NOT NULL PRIMARY KEY,
datetime timestamptz NOT NULL DEFAULT CURRENT_TIMESTAMP
);
insert into Events (datetime) VALUES
('2021-03-28 00:00:00-06'), -- say sunday is sparse, look it's every 8 hours-ISH, GIVE OR TAKE....
('2021-03-28 08:00:00-06'),
('2021-03-28 16:00:00-06'),
('2021-03-28 23:59:00-06'),
('2021-04-04 01:00:00-06'),
('2021-04-04 09:00:00-06'),
('2021-04-04 16:30:00-06'),
('2021-03-29 00:05:00-06'), -- say monday is dense, look, it's approximately hourly
('2021-03-29 01:04:00-06'),
('2021-03-29 02:03:00-06'),
('2021-03-29 03:02:00-06'),
('2021-03-29 04:01:00-06'),
('2021-03-29 05:00:00-06'),
('2021-03-29 06:01:00-06'),
('2021-03-29 07:02:00-06'),
('2021-03-29 08:03:00-06'),
('2021-03-29 09:04:00-06'),
('2021-03-29 10:05:00-06'),
('2021-03-29 11:01:00-06'),
('2021-03-29 12:02:00-06'),
('2021-03-29 13:03:00-06'),
('2021-03-29 14:04:00-06'),
('2021-03-29 15:05:00-06'),
('2021-03-29 16:04:00-06'),
('2021-03-29 17:03:00-06'),
('2021-03-29 18:02:00-06'),
('2021-03-29 19:01:00-06'),
('2021-03-29 20:01:00-06'),
('2021-03-29 21:02:00-06'),
('2021-03-29 22:03:00-06'),
('2021-03-29 23:05:00-06'),
('2021-03-29 23:59:00-06'),
('2021-04-05 00:04:00-06'),
('2021-04-05 01:03:00-06'),
('2021-04-05 02:02:00-06'),
('2021-04-05 03:01:00-06'),
('2021-04-05 04:01:00-06'),
('2021-04-05 05:03:00-06'),
('2021-04-05 06:04:00-06'),
('2021-04-05 07:01:00-06'),
('2021-04-05 08:02:00-06'),
('2021-04-05 09:03:00-06'),
('2021-04-05 10:04:00-06'),
('2021-04-05 11:05:00-06'),
('2021-04-05 12:02:00-06'),
('2021-04-05 13:03:00-06'),
('2021-04-05 14:04:00-06'),
('2021-04-05 15:02:00-06'),
('2021-04-05 16:05:00-06'),
('2021-04-05 17:03:00-06'),
('2021-04-05 18:02:00-06'),
('2021-04-05 19:01:00-06'),
('2021-04-05 20:01:00-06'),
('2021-04-05 21:02:00-06'),
('2021-04-05 22:03:00-06'),
('2021-04-05 23:05:00-06'),
('2021-04-05 23:59:00-06'
);
我的暂定查询如下所示:
with subset as (
select * from Events
where extract (dow from datetime::timestamptz AT TIME ZONE 'America/Chicago' = '0') -- subset on today is sunday
)
select
LAG(datetime,1) over (ORDER BY datetime) as lagged_interval,
extract (epoch from AVG(LAG(datetime,1) over (ORDER BY datetime))) as average_interval_in_seconds,
extract (epoch from percentile_cont(0.5) within group (LAG(datetime,1) over (ORDER BY datetime))) as median_interval_in_seconds,
max(datetime) as most_recent
from subset;
你可以在 db-fiddle中看到这个。
我难过的地方是:
LAG是区分时间戳与前一个时间戳的正确运算符吗?这是一种 rowid 函数吗?不同或更好的窗口函数?我现在得到的错误是:
查询错误:错误:“LAG”处或附近的语法错误
但是我对此有点困惑,因为根据那个文档它看起来是正确的。
- 如何处理分区中一天中的第一次和最后一次?如果我只是对周日做一个子集(有时区转换)不希望本周周日的开始回顾上周周日的结束,那会太长,我应该在我的外部添加一个 where
FROM
子句排除第一个观察,除了滞后?
更新
我认为这个查询更接近,它至少返回:
with subset as (
select *
from Events
where extract (dow from datetime::timestamptz AT TIME ZONE 'America/Chicago') = '0' -- subset on today is sunday
)
(select
max(most_recent),
AVG(ss.diff)/60 as avg_in_minutes,
percentile_cont(0.5)
within group(order by ss.diff)/60 as q50_in_minutes,
percentile_cont(0.05)
within group(order by ss.diff)/60 as q05_in_minutes,
percentile_cont(0.95)
within group(order by ss.diff)/60 as q95_in_minutes
from (
select
max(datetime) over (order by datetime) as most_recent,
extract(epoch from datetime) as t1,
extract(epoch from LAG(datetime,1) over (ORDER BY datetime)) as t0,
extract(epoch from datetime) - extract(epoch from LAG(datetime,1) over (ORDER BY datetime)) as diff
from subset
) as ss);
结果看起来不错!
最大限度 | avg_in_minutes | q50_in_minutes | q05_in_minutes | q95_in_minutes |
---|---|---|---|---|
2021-04-04T22:30:00.000Z | 2214 | 480 | 456 | 7439.999999999998 |
通过:https ://www.db-fiddle.com/f/PqdiwBNnnKx3bUw1ALgt2/0
但是,这个查询真是一团糟!现在我不清楚如何在我的子集中对分类变量进行分区以更普遍地做到这一点。例如,如果我的起始表有一个分类变量:
create table Events (
id serial NOT NULL PRIMARY KEY,
datetime timestamptz NOT NULL DEFAULT CURRENT_TIMESTAMP,
category text default 'A'
);
我希望这些组按类别划分:https ://www.db-fiddle.com/f/xr6JjG4KuuNVFLJCTof9UL/0
解决方案
尝试这个:
架构 (PostgreSQL v13)
create table Events (
id serial NOT NULL PRIMARY KEY,
datetime timestamptz NOT NULL DEFAULT CURRENT_TIMESTAMP,
category text default 'A'
);
insert into Events (datetime, category) VALUES
('2021-03-28 00:00:00-06', 'B'),
('2021-03-28 00:02:00-06', 'B');
insert into Events (datetime) VALUES
('2021-03-28 00:00:00-06'), -- say sunday is sparse, look it's every 8 hours-ISH, GIVE OR TAKE....
('2021-03-28 08:00:00-06'),
('2021-03-28 16:00:00-06'),
('2021-03-28 23:59:00-06'),
('2021-04-04 01:00:00-06'),
('2021-04-04 09:00:00-06'),
('2021-04-04 16:30:00-06'),
('2021-03-29 00:05:00-06'), -- say monday is dense, look, it's hourly
('2021-03-29 01:04:00-06'),
('2021-03-29 02:03:00-06'),
('2021-03-29 03:02:00-06'),
('2021-03-29 04:01:00-06'),
('2021-03-29 05:00:00-06'),
('2021-03-29 06:01:00-06'),
('2021-03-29 07:02:00-06'),
('2021-03-29 08:03:00-06'),
('2021-03-29 09:04:00-06'),
('2021-03-29 10:05:00-06'),
('2021-03-29 11:01:00-06'),
('2021-03-29 12:02:00-06'),
('2021-03-29 13:03:00-06'),
('2021-03-29 14:04:00-06'),
('2021-03-29 15:05:00-06'),
('2021-03-29 16:04:00-06'),
('2021-03-29 17:03:00-06'),
('2021-03-29 18:02:00-06'),
('2021-03-29 19:01:00-06'),
('2021-03-29 20:01:00-06'),
('2021-03-29 21:02:00-06'),
('2021-03-29 22:03:00-06'),
('2021-03-29 23:05:00-06'),
('2021-03-29 23:59:00-06'),
('2021-04-05 00:04:00-06'),
('2021-04-05 01:03:00-06'),
('2021-04-05 02:02:00-06'),
('2021-04-05 03:01:00-06'),
('2021-04-05 04:01:00-06'),
('2021-04-05 05:03:00-06'),
('2021-04-05 06:04:00-06'),
('2021-04-05 07:01:00-06'),
('2021-04-05 08:02:00-06'),
('2021-04-05 09:03:00-06'),
('2021-04-05 10:04:00-06'),
('2021-04-05 11:05:00-06'),
('2021-04-05 12:02:00-06'),
('2021-04-05 13:03:00-06'),
('2021-04-05 14:04:00-06'),
('2021-04-05 15:02:00-06'),
('2021-04-05 16:05:00-06'),
('2021-04-05 17:03:00-06'),
('2021-04-05 18:02:00-06'),
('2021-04-05 19:01:00-06'),
('2021-04-05 20:01:00-06'),
('2021-04-05 21:02:00-06'),
('2021-04-05 22:03:00-06'),
('2021-04-05 23:05:00-06'),
('2021-04-05 23:59:00-06'
);
查询 #1
with subset as (
select *
from Events
where extract (dow from datetime::timestamptz AT TIME ZONE 'America/Chicago') = '0'
)
(select
category,
max(most_recent),
AVG(ss.diff)/60 as avg_in_minutes,
percentile_cont(0.5)
within group(order by ss.diff)/60 as q50_in_minutes,
percentile_cont(0.05)
within group(order by ss.diff)/60 as q05_in_minutes,
percentile_cont(0.95)
within group(order by ss.diff)/60 as q95_in_minutes
from (
select
category,
max(datetime) over (partition by category order by datetime) as most_recent,
extract(epoch from datetime) as t1,
extract(epoch from LAG(datetime,1) over (ORDER BY datetime)) as t0,
extract(epoch from datetime) - extract(epoch from LAG(datetime,1) over (ORDER BY datetime)) as diff
from subset
) as ss
group by category);
类别 | 最大限度 | avg_in_minutes | q50_in_minutes | q05_in_minutes | q95_in_minutes |
---|---|---|---|---|---|
一个 | 2021-04-04T22:30:00.000Z | 1844.6666666666667 | 479 | 112.5 | 7005 |
乙 | 2021-03-28T06:02:00.000Z | 2 | 2 | 2 | 2 |