首页 > 解决方案 > 如何对列求和并将求和列添加到 Spark DataFrame?

问题描述

我有一个 Spark DataFrame,如下所示:

val someDF5 = Seq(
  ("202003101750", "202003101700",122),
  ("202003101800", "202003101700",12),
  ("202003101750", "202003101700",42),
  ("202003101810", "202003101700",2)
).toDF("number", "word","value")

num_records通过执行以下操作使用列:

val DF1 = someDF5.groupBy("number","word").agg(count("*").alias("num_records"))

DF1:

+------------+------------+-------------+                                       
|      number|        word|num_records  |
+------------+------------+-------------+
|202003101750|202003101700|            2|
|202003101800|202003101700|            1|
|202003101810|202003101700|            1|
+------------+------------+-------------+

如何添加另一列total_records来跟踪数据框的总数num_records并添加到数据框中?例如,这是我所期望的:

+------------+------------+-------------+-------------+--                                       
|      number|        word|num_records  |total_records  |
+------------+------------+-------------+----------------
|202003101750|202003101700|            2|             4 |
|202003101800|202003101700|            1|             4 |
|202003101810|202003101700|            1|             4 |
+------------+------------+-------------+----------------

注意:当 num_records 发生变化时,total_records 应该不断更新/添加

标签: scalaapache-sparkapache-spark-sql

解决方案


添加 withColumn 并计算仅此而已..

val someDF5 = Seq(
    ("202003101750", "202003101700", 122),
    ("202003101800", "202003101700", 12),
    ("202003101750", "202003101700", 42),
    ("202003101810", "202003101700", 2)

  ).toDF("number", "word", "value")
  val DF1 = someDF5.groupBy("number", "word").agg(count("*").alias("num_records"))
    .withColumn("total_records",lit(someDF5.count))
  DF1.show

结果 :

+------------+------------+-----------+-------------+
|      number|        word|num_records|total_records|
+------------+------------+-----------+-------------+
|202003101750|202003101700|          2|            4|
|202003101800|202003101700|          1|            4|
|202003101810|202003101700|          1|            4|
+------------+------------+-----------+-------------+

像这个计数一样增加的记录数会自动更新。

 val someDF5 = Seq(
    ("202003101750", "202003101700", 122),
    ("202003101800", "202003101700", 12),
    ("202003101750", "202003101700", 42),
    ("202003101810", "202003101700", 2),
      ("202003101810", "22222222", 222)
  ).toDF("number", "word", "value")
  val DF1 = someDF5.groupBy("number", "word").agg(count("*").alias("num_records"))
    .withColumn("total_records",lit(someDF5.count))

结果 :

+------------+------------+-----------+-------------+
|      number|        word|num_records|total_records|
+------------+------------+-----------+-------------+
|202003101750|202003101700|          2|            5|
|202003101800|202003101700|          1|            5|
|202003101810|202003101700|          1|            5|
|202003101810|    22222222|          1|            5|
+------------+------------+-----------+-------------+


推荐阅读