首页 > 解决方案 > pyspark 从现有列的值创建多个列

问题描述

我有一个像这样的数据框:

+------------------------------------+-----+-----+
|id  |point|count|
+------------------------------------+-----+-----+
|id_1|5    |9    |
|id_2|5    |1    |
|id_3|4    |3    |
|id_1|3    |3    |
|id_2|4    |3    |

id-point 对是唯一的。我想按 id 分组,并使用来自 count 列的值从 point 列创建列,如下所示:

+------------------------------------+-----+-----+
|id  |point_3|point_4|point_5|
+------------------------------------+-----+-----+
|id_1|3      |0      |9
|id_2|0      |3      |1
|id_3|0      |3      |0

如果您能指导我如何开始或朝哪个方向开始,将不胜感激。我觉得在这件事上停留了一段时间。

标签: pythonapache-sparkpyspark

解决方案


我们可以使用 pivot 来实现所需的结果:

from pyspark.sql import *
from pyspark.sql.functions import *

spark = SparkSession.builder.master("local[*]").getOrCreate()

#sample dataframe
in_values = [("id_1", 5, 9), ("id_2", 5, 1), ("id_3", 4, 3), ("id_1", 3, 3), ("id_2", 4, 3)]
in_df = spark.createDataFrame(in_values, "id string, point int, count int")

out_df = in_df.groupby("id").pivot("point").agg(sum("count"))

# To replace null by 0
out_df = out_df.na.fill(0)

# To rename columns
columns_to_rename = out_df.columns
columns_to_rename.remove("id")

for col in columns_to_rename:
    out_df = out_df.withColumnRenamed(col, f"point_{col}")

out_df.show()

+----+-------+-------+-------+
|  id|point_3|point_4|point_5|
+----+-------+-------+-------+
|id_2|      0|      3|      1|
|id_1|      3|      0|      9|
|id_3|      0|      3|      0|
+----+-------+-------+-------+

推荐阅读