首页 > 解决方案 > 在 Spark 的 Dataframe 中展平数组

问题描述

如何将数组展平为包含列 [a,b,c,d,e] 的数据框

root
 |-- arry: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- a string (nullable = true)
 |    |    |-- b: long (nullable = true)
 |    |    |-- c: string (nullable = true)
 |    |    |-- d: string (nullable = true)
 |    |    |-- e: long (nullable = true)

任何帮助表示赞赏。

标签: scalaapache-sparkapache-spark-sql

解决方案


假设您有一个具有以下结构的 json:

{
  "array": [
    {
      "a": "asdf",
      "b": 1234,
      "c": "a",
      "d": "str",
      "e": 1234
    },
    {
      "a": "asdf",
      "b": 1234,
      "c": "a",
      "d": "str",
      "e": 1234
    },
    {
      "a": "asdf",
      "b": 1234,
      "c": "a",
      "d": "str",
      "e": 1234
    }
  ]
}
  1. 读取文件
scala> val nested = spark.read.option("multiline",true).json("nested.json")
nested: org.apache.spark.sql.DataFrame = [array: array<struct<a:string,b:bigint,c:string,d:string,e:bigint>>]
  1. 检查架构
scala> nested.printSchema
root
 |-- array: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- a: string (nullable = true)
 |    |    |-- b: long (nullable = true)
 |    |    |-- c: string (nullable = true)
 |    |    |-- d: string (nullable = true)
 |    |    |-- e: long (nullable = true)
  1. 使用explode功能
scala> nested.select(explode($"array").as("exploded")).select("exploded.*").show
+----+----+---+---+----+
|   a|   b|  c|  d|   e|
+----+----+---+---+----+
|asdf|1234|  a|str|1234|
|asdf|1234|  a|str|1234|
|asdf|1234|  a|str|1234|
+----+----+---+---+----+

推荐阅读