首页 > 解决方案 > Bigquery - 使用基于计数的过滤器优化查询

问题描述

我有一个(工作)查询,它根据给定字段的计数过滤一个表,该字段由同一个表上的另一个字段分组,但感觉没有优化。

baseTable

code,id,event
55d718,ABAjH0j7yZmVBMhdDf4ab5,eventA
55d718,ABAjH0j7yZmVBMhdDf4ab5,eventB
55d718,ABAjH0j7yZmVBMhdDf4ab5,eventA
55d718,ABAjH0ggMvi-k5z8pbyR8_,eventA
98b1de,ABAjH0gkRy7s1enuFYGgzf,eventC
...

我 - 概念上 - 做的是id用于过滤的“黑名单” baseTable:任何id包含在黑名单中的行都应该被过滤。进入黑名单的规则是给定代码和特定事件的 id count 大于 1eventA

Expected result

code, eventA (count), eventB (count), eventC (count)
55d718, 352, 18, 12
98b1de, 846, 78, 65

预期的结果是对所有不同代码的每个事件的计数,但是对于给定的并且(特别是不是其他)id出现超过 1 次的任何事件都必须从最终结果中完全删除。在示例数据中有超过 1 行的代码,因此不得包含具有的单行(即使对于其他事件)。 codeevent = eventAeventAABAjH0j7yZmVBMhdDf4ab555d718eventAfilterTableid = ABAjH0j7yZmVBMhdDf4ab5

工作(但未优化)查询:

WITH 

  baseTable as (select
    REGEXP_EXTRACT(cs_uri,r'code=([^&]*)') AS code,
    REGEXP_EXTRACT(cs_uri,r'id=([^&]*)') AS id,
    REGEXP_EXTRACT(cs_uri,r'event=([^&]*)') AS event
    from  `my-project.raw_data` 
    ),


  filteredTable as (SELECT *
  FROM baseTable
  where id = 'abcd' AND id NOT IN (
  SELECT
    id
  from baseTable where (event = 'eventA')  and code = 'abcd'
  group by id
  having count(id ) > 1  
 )
 )

  SELECT eventA , eventB from 
  (Select count(id) from filteredTable where event = 'eventA') as eventA,
  (Select count(id) from filteredTable where event = 'eventB') as eventB
  ... other queries on filteredTable

我添加的每个请求都会filteredTable增加处理数据的大小baseTable,有没有更好的方法?

更好的(我认为)查询是直接设置 a having code='XXXX'baseTable因为我真的不需要同时查询多个代码,但问题是我不能这样做group by id, code, event,否则无法正确设置黑名单(因为这样就不可能得到 的计数eventA)。

标签: sqlgoogle-bigquery

解决方案


由于您没有提供示例数据和原始表中的行数,因此我目前只能看到一个优化点。

您不必在最后一部分编写子查询。因此,您可以将它们更改为count(case when condition then id end). 此外,由于您想要每个代码的输出,您应该使用group by code

另一件事是,我认为您在过滤表中过度过滤了案例。据我从您的解释中可以理解,您不应该为 abcd 值过滤 id 列。

因此,您可以将此代码用于预期输出。例如代码,我更改了您的基表代码。所以不要忘记使用正则表达式将其更改为您的原始代码:)

WITH 
baseTable as (
  SELECT '55d718' AS code, 'ABAjH0j7yZmVBMhdDf4ab5' as id,'eventA' as event UNION ALL
  SELECT '55d718' AS code, 'ABAjH0j7yZmVBMhdDf4ab5' as id,'eventB' as event UNION ALL
  SELECT '55d718' AS code, 'ABAjH0j7yZmVBMhdDf4ab5' as id,'eventA' as event UNION ALL
  SELECT '55d718' AS code, 'ABAjH0ggMvi-k5z8pbyR8_' as id,'eventA' as event UNION ALL
  SELECT '98b1de' AS code, 'ABAjH0gkRy7s1enuFYGgzf' as id,'eventC' as event
),
blackList as (
  SELECT id
  from baseTable where (event = 'eventA')  and code = 'abcd'
  group by id
  having count(id ) > 1  
),
filteredTable as (
  SELECT baseTable.*
  FROM baseTable
  LEFT JOIN blackList USING (id)
  where blackList.id IS NULL
)
SELECT 
  code, 
  count(case when event='eventA' then id end) as eventA,
  count(case when event='eventB' then id end) as eventB
from filteredTable
group by code

推荐阅读