首页 > 解决方案 > 生成涉及 Array、Pivo​​t 的复杂 Pyspark 表

问题描述

我有以下格式的表格

+-------+--------+
|Column1|Column2 |
+-------+--------+
|[A, 1] |X       |
|[A, 1] |Y       |
|[B, 2] |Y       |
|[B, 2] |Z       |
|[C, 1] |X       |
|[C, 1] |Z       |
+-------+--------+

我需要一个具有以下结果的表:

+-------+-------+-------+-------+
|       |[A, 1] |[B, 2] |[C, 1] |
+-------+-------+-------+-------+
|[A, 1] |[X, Y] |[Y]    |[X]    |
|[B, 2] |[Y]    |[Y, Z] |[Z]    |
|[C, 1] |[X]    |[Z]    |[X, Z] |
+-------+-------+-------+-------+

或者更好的是这样的结果:

+-------+-------+-------+-------+
|       |[A, 1] |[B, 2] |[C, 1] |
+-------+-------+-------+-------+
|[A, 1] |2      |1      |1      |
|[B, 2] |1      |2      |1      |
|[C, 1] |1      |1      |2      |
+-------+-------+-------+-------+

标签: arrayspysparkpivot

解决方案


这将是昂贵的,特别是对于大数据,但你可以这样做join+ pivot

from pyspark.sql.functions import count

df.alias("l").join(df.alias("r"), on="Column2")\
    .select("l.Column1", "r.Column1")\
    .groupBy("l.Column1")\
    .pivot("r.Column1")\
    .agg(count("r.Column1"))\
    .show()
#+-------+------+------+------+
#|Column1|[A, 1]|[B, 2]|[C, 1]|
#+-------+------+------+------+
#| [A, 1]|     2|     1|     1|
#| [B, 2]|     1|     2|     1|
#| [C, 1]|     1|     1|     2|
#+-------+------+------+------+

推荐阅读