首页 > 解决方案 > 根据多个值在数据框中生成 n 列

问题描述

我从表中创建了这样的数据框

df = spark.sql("select * from test")  # it is having 2 columns id and name
df2 = df.groupby('id').agg(collect_list('name')
df2.show()
|id|name|
|44038:4572|[0032477212299451]|
|44038:5439|[00324772, 0032477, 003247, 00324]|
|44038:4429|[0032477212299308]|

直到这里它是正确的,对于一个 id 我可以存储多个名称(值)。现在,当我尝试根据值在数据框中创建动态列时,它不起作用。

df3 = df2.select([df2.id] + [df2.name[i] for i in range (length)])

输出:

|id |name[0]|
|44038:4572|0032477212299451|
|44038:5439|00324772|
|44038:4429|032477212299308|

数据框中的预期输出:

|id|name[0]|name[1]|name[2]|name[3]|
|44038:4572|0032477212299451|null|null|null|
|44038:5439|00324772|0032477|003247|0034|
|44038:4429|032477212299308|null|null|null|

然后必须替换null0.

标签: apache-sparkpysparkapache-spark-sql

解决方案


您最好不pivot要这样做collect_list

from pyspark.sql import functions as F, Window

df2 = (df.withColumn('rn', F.row_number().over(Window.partitionBy('id').orderBy(F.desc('name'))))
         .groupBy('id')
         .pivot('rn')
         .agg(F.first('name'))
         .fillna("0")
      )

df2.show()
+----------+----------------+-------+------+-----+
|        id|               1|      2|     3|    4|
+----------+----------------+-------+------+-----+
|44038:4572|0032477212299451|      0|     0|    0|
|44038:5439|        00324772|0032477|003247|00324|
|44038:4429|0032477212299308|      0|     0|    0|
+----------+----------------+-------+------+-----+

如果你想要漂亮的列名,你可以这样做

df3 = df2.toDF('id', *[f'name{i}' for i in range(len(df2.columns) - 1)])
df3.show()
+----------+----------------+-------+------+-----+
|        id|           name0|  name1| name2|name3|
+----------+----------------+-------+------+-----+
|44038:4572|0032477212299451|      0|     0|    0|
|44038:5439|        00324772|0032477|003247|00324|
|44038:4429|0032477212299308|      0|     0|    0|
+----------+----------------+-------+------+-----+

推荐阅读