首页 > 解决方案 > 带有转置的pyspark列总和

问题描述

我有一个数据框看起来像 -

+---+---+---+---+
| id| w1| w2| w3|
+---+---+---+---+
|  1|100|150|200|
|  2|200|400|500|
|  3|500|600|150|
+---+---+---+---+

我希望输出看起来像 -

full   total_amt
 w1       800
 w2       1150
 w3       850

我的代码是 -

df = spark.createDataFrame(
    [(1, 100,150,200), (2, 200,400,500), (3, 500,600,150)], ("id", "w1","w2","w3"))

res = df.unionAll(
    df.select([
        F.lit('All').alias('id'), 
        F.sum(df.w1).alias('w1'),
        F.sum(df.w2).alias('w2'),
        F.sum(df.w3).alias('w3') 
    ]))
res.show()

But output gives me - 

+---+---+----+---+
| id| w1|  w2| w3|
+---+---+----+---+
|  1|100| 150|200|
|  2|200| 400|500|
|  3|500| 600|150|
|All|800|1150|850|
+---+---+----+---+

我认为添加后需要创建枢轴。所有字段本质上都是数字。

标签: pysparkpyspark-sqlpyspark-dataframes

解决方案


一个快速的解决方案可能是

>>> df.createOrReplaceTempView('df')

>>> spark.sql('''
...    select 'w1' as full, sum(w1) as total  from df 
...    union
...    select 'w2' as full, sum(w2) as total  from df 
...    union
...    select 'w3' as full, sum(w3) as total  from df 
... ''').show()
+----+-----+                                                                    
|full|total|
+----+-----+
|  w2| 1150|
|  w3|  850|
|  w1|  800|
+----+-----+

推荐阅读