sql - How do I count & get the name of category after grouping column in hive
问题描述
I have table with 3 columns
user_id
category
view_hour
I made a query for selecting user_id
that watched only onecategory
(Drama or Variety or pop).
Here is my query :
select *
from (
select user_id, category, sum(view_hour)
from kcp_01
group by user_id, category) p
group by p.user_id
having count(*) = 1)
But, I want to get the number of user and the category who watched only one category.
解决方案
您可以使用以下方法获取仅观看一个类别的用户:
select user_id, min(category) as category, sum(view_hour)
from kcp_01
group by user_id
having min(category) = max(category);
注意min(category)
是类别,因为只有一个类别。
在 Hive 中,您可能需要执行以下操作:
select user_id, min(category) as min_category,
max(category) as max_category, sum(view_hour)
from kcp_01
group by user_id
having min_category = max_category;
如果您想要每个类别的此类用户的数量,那么子查询会很有帮助:
select category, count(*) as number_single_users
from (select user_id, min(category) as min_category, max(category) as max_category, sum(view_hour) as view_hours
from kcp_01
group by user_id
having min_category) = max_category
) u
group by category;
另一种方法是简单地使用not exists
:
select k.category, count(distinct k.user_id)
from kcp_01 k
where not exists (select 1
from kcp_01 k2
where k2.user_id = k.user_id and
k2.category <> k.category
)
group by k.category;
我记得 - 曾几何时 - Hive 难以处理count(distinct)
,因此两种group by
方法可能效果更好。
推荐阅读
- xml - 为什么 XSLT 文档被认为是“样式表”?
- vb6 - VB6 中的报告无法从远程桌面工作
- css - 试图理解 ::before - 奇怪的(对我而言)行为?
- python - 在 Sanic 框架上上传时如何增加超时?
- javascript - 如何在 peerJS 中更改流
- c++ - C++模板模板参数类型推导
- keras - 在 Keras 的编译方法中添加 f1_score 指标
- laravel - 如何将 javascript 回调传递给 Laravel-Charts 中的 PHP 构造
- git - 使用交互式变基重新提交提交时的奇怪行为
- docker - 在不知道主机操作系统的情况下访问主机上的 Docker 守护进程