首页 > 解决方案 > 将每条记录转置为 pyspark 数据框中的多列

问题描述

我希望将每条记录转换为 pyspark 数据框中的多列。

这是我的数据框:

+--------+-------------+--------------+------------+------+
|level_1 |level_2      |level_3       |level_4     |UNQ_ID|
+--------+-------------+--------------+------------+------+
|D  Group|Investments  |ORB           |ECM         |1     |
|E  Group|Investment   |Origination   |Execution   |2     |
+--------+-------------+--------------+------------+------+

所需的数据框是:

+--------+---------------+------+
|level   |name           |UNQ_ID|
+--------+---------------+------+
|level_1 |D  Group       |1     |
|level_1 |E  Group       |2     |
|level_2 |Investments    |1     |
|level_2 |Investment     |2     |
|level_3 |ORB            |1     |
|level_3 |Origination    |2     |
|level_4 |ECM            |1     |
|level_4 |Execution      |2     |
+--------+---------------+------+

标签: apache-sparkpysparkapache-spark-sql

解决方案


使用堆栈函数的更简单方法:

import pyspark.sql.functions as f

output_df = df.selectExpr('stack(4, "level_1", level_1, "level_2", level_2, "level_3", level_3, "level_4", level_4) as (level, name)', 'UNQ_ID')
output_df.show()

# +-------+-----------+------+
# |  level|       name|UNQ_ID|
# +-------+-----------+------+
# |level_1|    D Group|     1|
# |level_2|Investments|     1|
# |level_3|        ORB|     1|
# |level_4|        ECM|     1|
# |level_1|    E Group|     2|
# |level_2|Investments|     2|
# |level_3|Origination|     2|
# |level_4|  Execution|     2|
# +-------+-----------+------+

推荐阅读