首页 > 解决方案 > 如何在火花中将数据框列类型从字符串转换为(数组和结构)

问题描述

我有一个具有以下架构的数据框,其中“名称”是字符串类型,值是带有数组和结构的复杂 JSON。

基本上使用字符串数据类型我无法解析数据并写入行。所以我正在尝试转换数据类型并应用explode来解析数据。

Current:
root
|--id: string (nullable = true)
|--partitionNo: string (nullable = true)
|--name: string (nullable = true)

转换后:

Expected:
root
|id: string (nullable = true)
|partitionNo: string (nullable = true)
|name: array (nullable = true)
|     |-- element: struct (containsNull = true) 
|     |    |-- extension: array (nullable = true)
|     |    |    |-- element: struct (containsNull = true)
|     |    |    |    |-- url: string (nullable = true)
|     |    |    |    |-- valueMetadata: struct (nullable = true)
|     |    |    |    |-- modifiedDateTime: string (nullable = true)
|     |    |    |    |-- code: string (nullable = true)
|     |    |-- lastName: string (nullable = true)
|     |    |-- firstName: array (nullable = true)
|     |    |    |-- element: string (containsNull = true)

标签: jsonscalaapache-sparkapache-spark-sql

解决方案


You can use from_json, but you need to provide a schema, which can be automatically inferred using a spaghetti code... because from_json only accepts a schema in the form of lit:

val df2 = df.withColumn(
    "name",
    from_json(
        $"name",
        // the lines below generate the schema
        lit(
            df.select(
                schema_of_json(
                    lit(
                        df.select($"name").head()(0)
                    )
                )
            ).head()(0)
        )
        // end of schema generation
    )
)

推荐阅读