首页 > 解决方案 > How do I count & get the name of category after grouping column in hive

问题描述

I have table with 3 columns

user_id
category
view_hour

I made a query for selecting user_id that watched only onecategory (Drama or Variety or pop).

Here is my query :

select *    
    from (
        select user_id, category, sum(view_hour)
        from kcp_01
        group by user_id, category) p
group by p.user_id

having count(*) = 1)

But, I want to get the number of user and the category who watched only one category.

标签: sqlhive

解决方案


您可以使用以下方法获取仅观看一个类别的用户:

select user_id, min(category) as category, sum(view_hour)
from kcp_01
group by user_id
having min(category) = max(category);

注意min(category)是类别因为只有一个类别。

在 Hive 中,您可能需要执行以下操作:

select user_id, min(category) as min_category,
       max(category) as max_category, sum(view_hour)
from kcp_01
group by user_id
having min_category = max_category;

如果您想要每个类别的此类用户的数量,那么子查询会很有帮助:

select category, count(*) as number_single_users
from (select user_id, min(category) as min_category, max(category) as max_category, sum(view_hour) as view_hours
      from kcp_01
      group by user_id
      having min_category) = max_category
    ) u
group by category;

另一种方法是简单地使用not exists

select k.category, count(distinct k.user_id)
from kcp_01 k
where not exists (select 1
                  from kcp_01 k2
                  where k2.user_id = k.user_id and
                        k2.category <> k.category
                 )
group by k.category;

我记得 - 曾几何时 - Hive 难以处理count(distinct),因此两种group by方法可能效果更好。


推荐阅读