scala - 如何对列求和并将求和列添加到 Spark DataFrame?
问题描述
我有一个 Spark DataFrame,如下所示:
val someDF5 = Seq(
("202003101750", "202003101700",122),
("202003101800", "202003101700",12),
("202003101750", "202003101700",42),
("202003101810", "202003101700",2)
).toDF("number", "word","value")
num_records
通过执行以下操作使用列:
val DF1 = someDF5.groupBy("number","word").agg(count("*").alias("num_records"))
DF1:
+------------+------------+-------------+
| number| word|num_records |
+------------+------------+-------------+
|202003101750|202003101700| 2|
|202003101800|202003101700| 1|
|202003101810|202003101700| 1|
+------------+------------+-------------+
如何添加另一列total_records
来跟踪数据框的总数num_records
并添加到数据框中?例如,这是我所期望的:
+------------+------------+-------------+-------------+--
| number| word|num_records |total_records |
+------------+------------+-------------+----------------
|202003101750|202003101700| 2| 4 |
|202003101800|202003101700| 1| 4 |
|202003101810|202003101700| 1| 4 |
+------------+------------+-------------+----------------
注意:当 num_records 发生变化时,total_records 应该不断更新/添加
解决方案
添加 withColumn 并计算仅此而已..
val someDF5 = Seq(
("202003101750", "202003101700", 122),
("202003101800", "202003101700", 12),
("202003101750", "202003101700", 42),
("202003101810", "202003101700", 2)
).toDF("number", "word", "value")
val DF1 = someDF5.groupBy("number", "word").agg(count("*").alias("num_records"))
.withColumn("total_records",lit(someDF5.count))
DF1.show
结果 :
+------------+------------+-----------+-------------+
| number| word|num_records|total_records|
+------------+------------+-----------+-------------+
|202003101750|202003101700| 2| 4|
|202003101800|202003101700| 1| 4|
|202003101810|202003101700| 1| 4|
+------------+------------+-----------+-------------+
像这个计数一样增加的记录数会自动更新。
val someDF5 = Seq(
("202003101750", "202003101700", 122),
("202003101800", "202003101700", 12),
("202003101750", "202003101700", 42),
("202003101810", "202003101700", 2),
("202003101810", "22222222", 222)
).toDF("number", "word", "value")
val DF1 = someDF5.groupBy("number", "word").agg(count("*").alias("num_records"))
.withColumn("total_records",lit(someDF5.count))
结果 :
+------------+------------+-----------+-------------+
| number| word|num_records|total_records|
+------------+------------+-----------+-------------+
|202003101750|202003101700| 2| 5|
|202003101800|202003101700| 1| 5|
|202003101810|202003101700| 1| 5|
|202003101810| 22222222| 1| 5|
+------------+------------+-----------+-------------+
推荐阅读
- git - 将文件移动到主文件夹中。现在分支有合并冲突
- postgresql - 执行 DMS 任务时未生成日志
- javascript - 轮播滑块不会按预期滑动
- python - 卷积神经网络形状
- java - 我想从设备获取现有文件的实际文件路径
- c# - 如何强制主机应用程序加载 .Net 插件版本的传递依赖项
- go - tls:使用 streadway/amqp 为 RabbitMQ 启用 tls 时握手失败
- flutter - 如何在没有任何用户输入的情况下让 Dialogflow ChatBot 开始聊天?
- javascript - mongodb 聚合被字符串化并且在 nodejs 中不工作
- c# - IIS 托管的 Azure Bot 通道注册 Microsoft Bot Framework sdk 聊天机器人不起作用