python - 计算时间窗口内的唯一值
问题描述
我的数据看起来像(超过 100.000 行):
timestamp Location person
2017-09-04 08:07:00 UTC A x
2017-09-04 08:08:00 UTC B y
2017-09-04 08:09:00 UTC A y
2017-09-04 08:07:00 UTC A x
2017-09-04 08:27:00 UTC B x
我想要什么:
Location Nr_of_persons_working_at_the_same_time
A 2
B 1
解释
timestamp Location person
2017-09-04 08:07:00 UTC A x <--- first action in A by person x
2017-09-04 08:08:00 UTC B y <--- different first action in B by person y
2017-09-04 08:09:00 UTC A y <--- second action in A, but could be different action as person x might be gone
2017-09-04 08:07:00 UTC A x <--- person x is still there, so count of persons in A is 2
2017-09-04 08:27:00 UTC B x <--- not a different action, person x coming in after 20 minutes, count of persons working at the same time remains 1
语境
我想通过查看最多 10 分钟的时间窗口(时间戳)并检查一个人是否真的同时工作或只是在其中接管他们的班次,来了解有多少人(人)在同一位置(位置)工作框架。我通过 SQL 查询获取数据,并可以使用 SQL 或 Python 对其进行解析。首选 SQL。
尝试过的解决方案
- 按位置分组,时间戳导致“硬削减”
- 可能需要一个所谓的窗口函数。但是在按时间戳排序后,如何防止位置混淆?
注意:如果更简单,我也可以尝试在 Python 中执行此操作,但我宁愿没有给出数据集的大小以及在云中执行此操作的有限选项。
解决方案
这应该工作
with mytable as (
select cast('2017-09-04 08:07:00' as datetime) as _timestamp ,'A' as Location,'x' as person union all
select cast('2017-09-04 08:08:00' as datetime) as _timestamp ,'B' as Location,'y' as person union all
select cast('2017-09-04 08:09:00' as datetime) as _timestamp ,'A' as Location,'y' as person union all
select cast('2017-09-04 08:07:00' as datetime) as _timestamp ,'A' as Location,'x' as person union all
select cast('2017-09-04 08:27:00' as datetime) as _timestamp ,'B' as Location,'x' as person
),
sorted_entry
as (
select *,
ifnull(first_value(_timestamp) over(partition by Location order by _timestamp ),_timestamp ) as prev_timestamp ,
ifnull(lag(person) over(partition by Location order by _timestamp ),person ) as another_person
from mytable
)
,flagged
as
(
select *,
case when person <> another_person then (
case when datetime_diff(_timestamp,prev_timestamp,minute) <= 10 then 1
else 0 end
)
else 0
end as flag
from sorted_entry
)
select location ,sum(flag) + 1 as _count
from flagged
group by location
推荐阅读
- r - FMStable::ImpliedVol 中是否存在错误?
- javascript - 快递/哈巴狗。服务器端渲染与客户端渲染
- java - 通过触摸检测位图中的颜色
- c - 擦除和写入闪存会在构建时出错
- oracle - 从 Id 中删除 @GeneratedValue 对性能有什么影响
- c# - C# WPF 点击随机按钮
- jenkins - Jenkins perforce 在 Windows 上触发构建?
- javascript - 将 ng-src 的值传递给 ng-model - AngularJs 签名指令
- hadoop - Apache Nifi MergeContent 输出数据不一致?
- python - 曲线拟合方程python