首页 > 解决方案 > 在scala spark中统一数据框的两行

问题描述

我有一个具有相同记录的 DataFrame,除了金额字段,我想要将它统一在一行中,并且金额字段具有两个金额的总和。就像我会做的那样?在斯卡拉。

I get the dataframe from a database:

val my_table = spark.read.table("table.myTable")
val df = my_table
.filter(col("ID")==="10") and
.filter(col("CENT")==="20") and
.filter(col("PROD")=== "122") and
.filter(col("CONTR").isin("0004", "0005", "0006"))).select(
"ID", "CENT", "PROD", "CONTR", "COD", "DATE", "AMOUNT").distinct()

df.show()


---------+--------+--------------+------------+-------------+-----------+--------+
ID       | CENT   | PROD         |CONTR       |COD          | DATE      | Amount |    
---------+--------+--------------+------------+-------------+-----------+--------+
10       |20      |122           |0004        |COD1         |2006-11-04 | 150.0  |
10       |20      |122           |0004        |COD1         |2006-11-04 | 300.0  |
10       |20      |122           |0005        |COD2         |2012-10-17 | 100.0  |
10       |20      |122           |0006        |COD3         |2015-12-05 | 500.0  |
---------+--------+--------------+------------+-------------+-----------+--------+

Expected:

---------+--------+--------------+------------+-------------+-----------+--------+
ID       | CENT   | PROD         |CONTR       |COD          | DATE      | Amount |    
---------+--------+--------------+------------+-------------+-----------+--------+
10       |20      |122           |0004        |COD1         |2006-11-04 | 450.0  |
10       |20      |122           |0005        |COD2         |2012-10-17 | 100.0  |
10       |20      |122           |0006        |COD3         |2015-12-05 | 500.0  |
---------+--------+--------------+------------+-------------+-----------+--------+

标签: scalaapache-spark

解决方案


下面的代码对所有列进行分组,除了amount并在列上运行sum聚合amount并按以下顺序排序cod

object GroupBy {

  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().master("local[*]").getOrCreate();
    import spark.implicits._
    val df = List(Bean3(10,20,122,"0004","COD1","2006-11-04",150.0),
      Bean3(10,20,122,"0004","COD1","2006-11-04",300.0),
      Bean3(10,20,122,"0005","COD2","2012-10-17",100.0),
      Bean3(10,20,122,"0006","COD3","2015-12-05",500.0)
    ).toDF
    val groupByCol = df.columns.diff(Array("id", "amount"))
    df.groupBy("id",groupByCol: _*).sum("amount")
      .withColumnRenamed("sum(amount)","amount")
      .orderBy("cod")
      .show()
  }

}

case class Bean3(id : Int,cent : Int,prod: Int,contr : String,cod : String,date : String,amount : Double)


+---+----+----+-----+----+----------+------+
| id|cent|prod|contr| cod|      date|amount|
+---+----+----+-----+----+----------+------+
| 10|  20| 122| 0004|COD1|2006-11-04| 450.0|
| 10|  20| 122| 0005|COD2|2012-10-17| 100.0|
| 10|  20| 122| 0006|COD3|2015-12-05| 500.0|
+---+----+----+-----+----+----------+------+

推荐阅读