首页 > 解决方案 > 选择与表中的字段相关的每一行

问题描述

假设我们有一个名为 track 的表,如下所示:

path   user_id   item_id
------------------------
view   A         I1
view   A         I3
buy    B         I2
view   C         I3
view   A         I4
buy    A         I1
view   B         I5
view   B         I4

我们想为那些购买过东西的用户选择用户查看历史记录,从而产生类似的输出

user_id  view_history   item_bought
-----------------------------------
A        [I1, I4, I3]   [I1]
B        [I4, I5]       [I2]

在这里订单并不重要,我们只想跟踪用户在购买时看到的内容。我尝试过的(使用一些 HIVE 符号):

    WITH cus_bought AS(
    SELECT 
        user_id
        ,COLLECT_LIST(item_id) item_bought
    FROM
        tracks
    WHERE
        path = 'buy'
    GROUP BY
        user_id
),
views AS(
    SELECT 
         t1.user_id
         ,COLLECT_LIST(item_id) event_data
    FROM 
        tracks t1
    WHERE
        path = 'view'
        AND EXISTS (
                    SELECT NULL
                    FROM 
                        tracks t2
                    WHERE
                        t2.user_id = t1.user_id
                   )
    GROUP BY 
        t1.user_id
)
SELECT c.user_id
       ,v.event_data
       ,c.item_bought
FROM
    cus_bought c
JOIN
    views v
ON c.user_id = v.user_id

是否有另一种方法来执行此查询?我面对的桌子很大,所以 JOIN 非常昂贵。自加入可以避免吗?我想知道某种旋转是否会帮助我,但不会在那个方向上取得成功。最后,在第二个 CTE 中使用 EXISTS 子句的原因是,考虑到表的大小,我希望 JOIN 处理尽可能少的行(实际上,我尝试在加入之前获得 1-1 关系)。

任何帮助,将不胜感激。

标签: sqlperformancehivehiveql

解决方案


你可以简单地使用collect_set()吗?

select user,
       collect_set(case when path = 'view' then item_id) as views,
       collect_set(case when path = 'buy' then item_id) as buys
from tracks t
group by user;

编辑:

如果您只想要拥有buy或不希望他们成为外部聚合一部分的用户,则可以将它们过滤掉。我不确定这是否更快,但是:

select user,
       collect_set(case when t.path = 'view' then t.item_id) as views,
       collect_set(case when t.path = 'buy' then t.item_id) as buys
from tracks t join
     (select distinct t2.user
      from tracks t2
      where t2.path = 'buy'
     ) t2
     on t.user = t2.user
group by user;

我不知道distinctand的开销是否join超过了聚合的开销,但值得一试。

您还可以使用窗口函数:

select user,
       collect_set(case when t.path = 'view' then t.item_id) as views,
       collect_set(case when t.path = 'buy' then t.item_id) as buys
from (select t.*,
            sum(case when t.path = 'buy' then 1 else 0 end) over (partition by user) as num_buys
      from tracks t
     ) t
where num_buys > 0
group by user;

推荐阅读