首页 > 解决方案 > 如何将结构数组转换为多列?

问题描述

我有一个架构:

root (original)
 |-- entries: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- col1: string (nullable = false)
 |    |    |-- col2: string (nullable = true) 

我怎样才能把它弄平?

root (derived)
 |-- col1: string (nullable = false)
 |-- col2: string (nullable = true)
 |-- col3: string (nullable = false)
 |-- col4: string (nullable = true)
 |-- ...

其中 col1...n 是 [col1 from original] 并且 col1...n 的值是 [col2 from original] 的值

例子:

+--------------------------------------------+
|entries                                     |
+--------------------------------------------+
|[[a1, 1], [a2, P], [a4, N]                  |
|[[a1, 1], [a2, O], [a3, F], [a4, 1], [a5, 1]|
+--------------------------------------------+

我想创建下一个数据集:

+-------------------------+
| a1 | a2 | a3  | a4 | a5 |
+-------------------------+
| 1  | P  | null| N | null|
| 1  | O  | F   | 1 | 1   |
+-------------------------+

标签: apache-sparkapache-spark-sql

解决方案


您可以使用 和 的组合来做到这一点explodepivot为此,需要创建row_id第一个:

val df = Seq(
  Seq(("a1", "1"), ("a2", "P"), ("a4", "N")),
  Seq(("a1", "1"), ("a2", "O"), ("a3", "F"), ("a4", "1"), ("a5", "1"))
).toDF("arr")
  .select($"arr".cast("array<struct<col1:string,col2:string>>"))

df
  .withColumn("row_id", monotonically_increasing_id())
  .select($"row_id", explode($"arr"))
  .select($"row_id", $"col.*")
  .groupBy($"row_id").pivot($"col1").agg(first($"col2"))
  .drop($"row_id")
  .show()

给出:

+---+---+----+---+----+
| a1| a2|  a3| a4|  a5|
+---+---+----+---+----+
|  1|  P|null|  N|null|
|  1|  O|   F|  1|   1|
+---+---+----+---+----+

推荐阅读