首页 > 解决方案 > 带有列表列表的数据框如何将每一行作为列展开 - pyspark

问题描述

我有一个如下的数据框

--------------------+
|                pas1|
+--------------------+
|[[[[H, 5, 16, 201...|
|[, 1956-09-22, AD...|
|[, 1961-03-19, AD...|
|[, 1962-02-09, AD...|
+--------------------+

想要从 4 行以上的每一行中提取几列,并创建一个如下所示的数据框。列名应该来自架构,而不是像 column1 和 column2 这样的硬编码。

---------|-----------+
| gender | givenName |
+--------|-----------+
|      a |       b   |
|      a |       b   |
|      a |       b   |
|      a |       b   |
+--------------------+

pas1 - schema
root
|-- pas1: struct (nullable = true)
|    |-- contactList: struct (nullable = true)
|    |    |-- contact: array (nullable = true)
|    |    |    |-- element: struct (containsNull = true)
|    |    |    |    |-- contactTypeCode: string (nullable = true)
|    |    |    |    |-- contactMediumTypeCode: string (nullable = true)
|    |    |    |    |-- contactTypeID: string (nullable = true)
|    |    |    |    |-- lastUpdateTimestamp: string (nullable = true)
|    |    |    |    |-- contactInformation: string (nullable = true)
|    |-- dateOfBirth: string (nullable = true)
|    |-- farePassengerTypeCode: string (nullable = true)
|    |-- gender: string (nullable = true)
|    |-- givenName: string (nullable = true)
|    |-- groupDepositIndicator: string (nullable = true)
|    |-- infantIndicator: string (nullable = true)
|    |-- lastUpdateTimestamp: string (nullable = true)
|    |-- passengerFOPList: struct (nullable = true)
|    |    |-- passengerFOP: struct (nullable = true)
|    |    |    |-- fopID: string (nullable = true)
|    |    |    |-- lastUpdateTimestamp: string (nullable = true)
|    |    |    |-- fopFreeText: string (nullable = true)
|    |    |    |-- fopSupplementaryInfoList: struct (nullable = true)
|    |    |    |    |-- fopSupplementaryInfo: array (nullable = true)
|    |    |    |    |    |-- element: struct (containsNull = true)
|    |    |    |    |    |    |-- type: string (nullable = true)
|    |    |    |    |    |    |-- value: string (nullable = true)

谢谢您的帮助

标签: pysparkpyspark-sql

解决方案


如果您想从包含结构的数据框中提取几列,您可以简单地执行以下操作:

from pyspark.sql import SparkSession,Row
spark = SparkSession.builder.appName('Test').getOrCreate()
df = spark.sparkContext.parallelize([Row(pas1=Row(gender='a', givenName='b'))]).toDF()

df.select('pas1.gender','pas1.givenName').show()

相反,如果您想展平您的数据框,这个问题应该可以帮助您:如何将嵌套的 Struct 列展开为多个列?


推荐阅读