首页 > 解决方案 > 如何根据 PySpark 中的条件进行分组

问题描述

如何根据 PySpark 中的条件进行分组?

这是一个示例数据:

+-----+-------+-------------+------------+
| zip | state | Agegrouping | patient_id |
+-----+-------+-------------+------------+
| 123 | x     | Adult       |        123 |
| 124 | x     | Children    |        231 |
| 123 | x     | Children    |        456 |
| 156 | x     | Adult       |        453 |
| 124 | y     | Adult       |         34 |
| 432 | y     | Adult       |         23 |
| 234 | y     | Children    |         13 |
| 432 | z     | Children    |         22 |
| 234 | z     | Adult       |         44 |
+-----+-------+-------------+------------+

然后想将数据视为:

+-----+-------+-------+----------+------------+
| zip | state | Adult | Children | patient_id |
+-----+-------+-------+----------+------------+
| 123 | x     |     1 |        1 |          2 |
| 124 | x     |     1 |        1 |          2 |
| 156 | x     |     1 |        0 |          1 |
| 432 | y     |     1 |        1 |          2 |
| 234 | z     |     1 |        1 |          2 |
+-----+-------+-------+----------+------------+

我怎样才能做到这一点?

标签: sqlpysparkcountapache-spark-sqlpivot

解决方案


这是火花 sql 版本。

df.createOrReplaceTempView('table')

spark.sql('''
    select zip, state,
           count(if(Agegrouping = 'Adult', 1, null)) as adult,
           count(if(Agegrouping = 'Children', 1, null)) as children,
           count(1) as patient_id
    from table
    group by zip, state;
''').show()

+---+-----+-----+--------+----------+
|zip|state|adult|children|patient_id|
+---+-----+-----+--------+----------+
|123|    x|    1|       1|         2|
|156|    x|    1|       0|         1|
|234|    z|    1|       0|         1|
|432|    z|    0|       1|         1|
|234|    y|    0|       1|         1|
|124|    y|    0|       0|         1|
|124|    x|    0|       1|         1|
|432|    y|    1|       0|         1|
+---+-----+-----+--------+----------+

推荐阅读