apache-spark - Pyspark agg function to "explode" rows into columns
问题描述
Basically, I have a dataframe that looks like this:
+----+-------+------+------+
| id | index | col1 | col2 |
+----+-------+------+------+
| 1 | a | a11 | a12 |
+----+-------+------+------+
| 1 | b | b11 | b12 |
+----+-------+------+------+
| 2 | a | a21 | a22 |
+----+-------+------+------+
| 2 | b | b21 | b22 |
+----+-------+------+------+
and my desired output is this:
+----+--------+--------+--------+--------+
| id | col1_a | col1_b | col2_a | col2_b |
+----+--------+--------+--------+--------+
| 1 | a11 | b11 | a12 | b12 |
+----+--------+--------+--------+--------+
| 2 | a21 | b21 | a22 | b22 |
+----+--------+--------+--------+--------+
So basically I want to "explode" the index
column into new columns after I groupby id
. Btw, the id
counts are the same and each id
has the same set of index
values. I'm using pyspark.
解决方案
使用 pivot 可以实现所需的输出。
from pyspark.sql import functions as F
df = spark.createDataFrame([[1,"a","a11","a12"],[1,"b","b11","b12"],[2,"a","a21","a22"],[2,"b","b21","b22"]],["id","index","col1","col2"])
df.show()
+---+-----+----+----+
| id|index|col1|col2|
+---+-----+----+----+
| 1| a| a11| a12|
| 1| b| b11| b12|
| 2| a| a21| a22|
| 2| b| b21| b22|
+---+-----+----+----+
使用枢轴
df3 =df.groupBy("id").pivot("index").agg(F.first(F.col("col1")),F.first(F.col("col2")))
collist=["id","col1_a","col2_a","col1_b","col2_b"]
重命名列
df3.toDF(*collist).show()
+---+------+------+------+------+
| id|col1_a|col2_a|col1_b|col2_b|
+---+------+------+------+------+
| 1| a11| a12| b11| b12|
| 2| a21| a22| b21| b22|
+---+------+------+------+------+
请注意根据您的要求重新排列列。
推荐阅读
- performance - 如何提高 Google Cloud Platform 中的 Tps?
- javascript - IdentityServer 在代码流期间应该返回什么?
- google-cloud-run - 谷歌云秘密管理器nodejs getsecret问题
- windows-xp - 使用尸检工具时如何确定Windows XP中用户帐户的登录次数和上次登录日期?
- java - 当所有字段以及 RadioButton 和 Spinner 之一在 Android Studio 中进行检查和归档时,如何启用按钮保存?
- terraform - local_file 和 provider 之间的 Terraform 依赖问题
- merge - 如何合并ical日历
- asp.net-mvc - SQL Server:在存储过程中使用临时表时的并发性
- gstreamer - 如果我添加了 udpsink 插件,为什么 gstreamer 管道无法播放
- swiftui - SwiftUI - 如何基于@ObservedObject 为每个列表视图项创建编辑视图