首页 > 解决方案 > SQL/BigQuery:如何避免对一个组的多个非连续成员进行分组?

问题描述

我遇到了一个我自己似乎无法解决的问题。我按位置和时间戳对行进行分组,并为实体保持静止的实例查找第一个和最后一个时间戳。问题是对于我当前的代码,当实体返回到它之前的位置时,SQL 将行组合在一起。

在我的示例中,实体在 2020-05-24 05:22:00 位于位置 -66.89 10.5002,然后在 2020-05-24 11:13:00 返回到该位置。当前查询的结果使该实体看起来一直在该位置,尽管中间的行清楚地表明它已移动。这是一个概念问题,我真的不知道如何在 SQL 中解决。我在 Big Query 中执行此操作,但我记得在 SQL Server 中遇到了类似的问题。

代码:

with selection as (
select 1 as id,TIMESTAMP '2020-05-24 11:13:00' as timestamp_, 'POINT(-66.89 10.5002)' as geom
union all select
1,TIMESTAMP '2020-05-24 05:22:00','POINT(-66.89 10.5002)'
union all select
1,TIMESTAMP '2020-05-24 05:25:00','POINT(-66.8881 10.4994)'
union all select
1,TIMESTAMP '2020-05-24 09:14:00','POINT(-66.8888 10.4958)'
union all select
1,TIMESTAMP '2020-05-24 07:37:00 UTC','POINT(-66.8873 10.5)'
union all select
1, TIMESTAMP'2020-05-24 07:52:00 UTC','POINT(-66.8873 10.5)'
)

select id,timestamp_,geom,
first_value(timestamp_)
    OVER (PARTITION BY id,geom ORDER BY timestamp_ ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS interval_start,
last_value(timestamp_)
    OVER (PARTITION BY id,geom ORDER BY timestamp_ ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS interval_end,
FROM
selection order by id,timestamp_

结果。注意第一行和最后一行的 interval_start 和 interval_end

ID 时间戳_ 几何 间隔开始 间隔结束
1 2020-05-24 05:22:00 UTC 点(-66.89 10.5002) 2020-05-24 05:22:00 UTC 2020-05-24 11:13:00 UTC
1 2020-05-24 05:25:00 UTC 点(-66.8881 10.4994) 2020-05-24 05:25:00 UTC 2020-05-24 05:25:00 UTC
1 2020-05-24 07:37:00 UTC 点(-66.8873 10.5) 2020-05-24 07:37:00 UTC 2020-05-24 07:52:00 UTC
1 2020-05-24 07:52:00 UTC 点(-66.8873 10.5) 2020-05-24 07:37:00 UTC 2020-05-24 07:52:00 UTC
1 2020-05-24 09:14:00 UTC 点(-66.8888 10.4958) 2020-05-24 09:14:00 UTC 2020-05-24 09:14:00 UTC
1 2020-05-24 11:13:00 UTC 点(-66.89 10.5002) 2020-05-24 05:22:00 UTC 2020-05-24 11:13:00 UTC

期望的结果:

ID 时间戳_ 几何 间隔开始 间隔结束
1 2020-05-24 05:22:00 UTC 点(-66.89 10.5002) 2020-05-24 05:22:00 UTC 2020-05-24 05:22:00 UTC
1 2020-05-24 05:25:00 UTC 点(-66.8881 10.4994) 2020-05-24 05:25:00 UTC 2020-05-24 05:25:00 UTC
1 2020-05-24 07:37:00 UTC 点(-66.8873 10.5) 2020-05-24 07:37:00 UTC 2020-05-24 07:52:00 UTC
1 2020-05-24 07:52:00 UTC 点(-66.8873 10.5) 2020-05-24 07:37:00 UTC 2020-05-24 07:52:00 UTC
1 2020-05-24 09:14:00 UTC 点(-66.8888 10.4958) 2020-05-24 09:14:00 UTC 2020-05-24 09:14:00 UTC
1 2020-05-24 11:13:00 UTC 点(-66.89 10.5002) 2020-05-24 11:13:00 UTC 2020-05-24 11:13:00 UTC

标签: sqlgoogle-bigquery

解决方案


考虑下面

with selection as (
  select 1 as id,TIMESTAMP '2020-05-24 11:13:00' as timestamp_, 'POINT(-66.89 10.5002)' as geom union all select
  1,TIMESTAMP '2020-05-24 05:22:00','POINT(-66.89 10.5002)' union all select
  1,TIMESTAMP '2020-05-24 05:25:00','POINT(-66.8881 10.4994)' union all select
  1,TIMESTAMP '2020-05-24 09:14:00','POINT(-66.8888 10.4958)' union all select
  1,TIMESTAMP '2020-05-24 07:37:00 UTC','POINT(-66.8873 10.5)' union all select
  1, TIMESTAMP'2020-05-24 07:52:00 UTC','POINT(-66.8873 10.5)'
), pregrouped_selection as (
  select id, timestamp_, geom, 
    countif(flag) over(partition by id order by timestamp_) grp
  from (
    select id, timestamp_, geom,
      geom != ifnull(lag(geom) over(partition by id order by timestamp_), geom) flag,
    from selection 
  )
  order by id, timestamp_
)
select id,timestamp_,geom,
first_value(timestamp_)
    OVER (PARTITION BY id,grp ORDER BY timestamp_ ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS interval_start,
last_value(timestamp_)
    OVER (PARTITION BY id,grp ORDER BY timestamp_ ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS interval_end,
FROM
pregrouped_selection order by id,timestamp_    

带输出

在此处输入图像描述

如您所见 - 我几乎 100% 保留了您的原始查询 - 只是替换geomgrp内部over()语句 AND 从中pregrouped_selection计算组号 - grp


推荐阅读