首页 > 解决方案 > SQL Impala 通过最常见的值减少 id

问题描述

给定如下表:

+--+------------------+-----------+
|id|     diagnosis_age|  diagnosis|
+--+------------------+-----------+
| 1|2.1843037179180302| 315.320000|
| 1|  2.80033330216659| 315.320000|
| 1|   2.8222365762732| 315.320000|
| 1|  5.64822705794013| 325.320000|
| 1| 5.686557787521759| 335.320000|
| 2|  5.70572315231258| 315.320000|
| 2| 5.724888517103389| 315.320000|
| 3| 5.744053881894209| 315.320000|
| 3|5.7604813374292005| 315.320000|
| 3|  5.77993740687426| 315.320000|
+--+------------------+-----------+

我想通过选择诊断最频繁的诊断和年龄将每个 id 减少到一行。

输出看起来像:

+--+------------------+-----------+
|id|     diagnosis_age|  diagnosis|
+--+------------------+-----------+
| 1|2.1843037179180302| 315.320000|
| 2|  5.70572315231258| 315.320000|
| 3| 5.744053881894209| 315.320000|
+--+------------------+-----------+

我试过这样的事情:

SELECT id, diagnosis, age,
COUNT(distinct diagnosis) OVER (partition by id)
FROM table

但我真的没有任何工作!

使用以下代码:

wc=Window().partitionBy("id", "diagnosis").orderBy("diagnosis_age")
wc2=Window().partitionBy("id")
ddfc.withColumn("count", F.count("diagnosis").over(wc))\
  .withColumn("max", F.max("count").over(wc2))\
  .filter("count=max")\
.groupBy("id").agg(F.first("diagnosis_age").alias("diagnosis_age"), F.first("diagnosis").alias("diagnosis"))\
  .orderBy("id")\
.groupBy("diagnosis")\
.count()\
.orderBy("count", ascending = False)\
.show(5)

我得到:

+--------------+-----+
|diagnosis     |count|
+--------------+-----+
|         V20.2|22179|
|    382.900000|12985|
|    389.900000|11333|
|    381.810000| 7448|
|    493.900000| 3249|
+--------------+-----+

而您的代码(修改为按诊断分组并获得 5 个最常见的诊断)输出:

+--------------+------+
|diagnosis     |count |
+--------------+------+
| 389.900000   |13268 | 
| 382.900000   | 7572 | 
| V20.2        | 6193 | 
| 381.810000   | 4735 | 
| V72.19       | 4115 | 
+--------------+------+

感谢帮助。

标签: sqlgroup-byfrequencyimpalapartition

解决方案


在统计学中,这是最常见的值,称为众数。

您可以使用聚合和row_number()

select id, diagnosis, age
from (select id, diagnosis, min(age) as age, count(*) as cnt,
             row_number() over (partition by id order by count(*) desc) as seqnum
      from t
      group by id, diagnosis
     ) da
where seqnum = 1;

推荐阅读