sql - 如何通过 BigQuery 中的每个用户使用线性插值填充不规则缺失的时间序列值?
问题描述
我有每个用户不规则地缺少时间序列值的数据,我想使用 BigQuery 标准 SQL 使用线性插值将其转换为一定间隔。
+------+---------------------+-------+
| name | time | value |
+------+---------------------+-------+
| Jane | 2020-11-14 09:01:00 | 3 |
| Jane | 2020-11-14 09:05:00 | 5 |
| Jane | 2020-11-14 09:07:00 | 1 |
| Jane | 2020-11-14 09:09:00 | 8 |
| Jane | 2020-11-14 09:10:00 | 4 |
| Kay | 2020-11-14 09:01:00 | 7 |
| Kay | 2020-11-14 09:04:00 | 1 |
| Kay | 2020-11-14 09:05:00 | 10 |
| Kay | 2020-11-14 09:09:00 | 6 |
| Kay | 2020-11-14 09:10:00 | 7 |
+------+---------------------+-------+
我想将其转换如下:
+------+---------------------+-------+-----------------+
| name | time | value | |
+------+---------------------+-------+-----------------+
| Jane | 2020-11-14 09:01:00 | 3 | |
| Jane | 2020-11-14 09:02:00 | 3.5 | <= interpolaetd |
| Jane | 2020-11-14 09:03:00 | 4 | <= interpolaetd |
| Jane | 2020-11-14 09:04:00 | 4.5 | <= interpolaetd |
| Jane | 2020-11-14 09:05:00 | 5 | |
| Jane | 2020-11-14 09:06:00 | 3 | <= interpolaetd |
| Jane | 2020-11-14 09:07:00 | 1 | |
| Jane | 2020-11-14 09:08:00 | 4.5 | <= interpolaetd |
| Jane | 2020-11-14 09:09:00 | 8 | |
| Jane | 2020-11-14 09:10:00 | 4 | |
| Kay | 2020-11-14 09:01:00 | 7 | |
| Kay | 2020-11-14 09:02:00 | 5 | <= interpolaetd |
| Kay | 2020-11-14 09:03:00 | 3 | <= interpolaetd |
| Kay | 2020-11-14 09:04:00 | 1 | |
| Kay | 2020-11-14 09:05:00 | 10 | |
| Kay | 2020-11-14 09:06:00 | 9 | <= interpolaetd |
| Kay | 2020-11-14 09:07:00 | 8 | <= interpolaetd |
| Kay | 2020-11-14 09:08:00 | 7 | <= interpolaetd |
| Kay | 2020-11-14 09:09:00 | 6 | |
| Kay | 2020-11-14 09:10:00 | 7 | |
+------+---------------------+-------+-----------------+
我可以问你一些聪明的解决方案吗?
补充:这是这个 stackoverflow question的应用问题。它非常相似但不同之处在于该数据是时间序列数据,并且它具有每个用户的名称。
谢谢你。
解决方案
This is not very different from your previous question. Starting from the accepted answer, you could do:
select name, time,
ifnull(value, start_value + (end_value - start_value) / (end_tick - start_tick) * (time - start_tick)) as value_interpolated
from (
select name, time, value,
first_value(tick ignore nulls ) over win1 as start_tick,
first_value(value ignore nulls) over win1 as start_value,
first_value(tick ignore nulls ) over win2 as end_tick,
first_value(value ignore nulls) over win2 as end_value,
from (
select name, time, t.time as tick, value
from (
select name, generate_array(min(time), max(time)) times
from `project.dataset.table`
group by name
)
cross join unnest(times) time
left join `project.dataset.table` t using(name, time)
)
window
win1 as (partition by name order by time desc rows between current row and unbounded following),
win2 as (partition by name order by time rows between current row and unbounded following)
)
推荐阅读
- excel - 用于从多个 csv 中选择特定列并导入到一个工作表的宏
- angular - Angular - 如何在数据网格中显示数组数组?
- windows - perl on Windows: Can't do inplace edit on file: File exists
- python - Keras 网络产生逆向预测
- javascript - 单击按钮时的中心 Div
- java - 如何将 ServerHttpRequest/ServerHttpResponse 正文转换为字符串
- javascript - How to merge object inside object
- r - 子集列表以提取多个元素
- uvm - If I have a fixed size array , how do I write a constraint so that each multi-bit element of the array after randomization is an odd number
- c# - Unity 游戏在 IOS 上崩溃,但在 Android 上完美运行