首页 > 解决方案 > 如何根据 spark scala 中的条件获取结果

问题描述

我有以下数据框

+---+-----+---+
| ID|Month|val|
+---+-----+---+
| M1|  Jan| 10|
| M1|  Feb| 20|
| M1|  Mar| 30|
| M2|  Jan| 22|
| M2|  Feb| 33|
| M3|  Jan| 30|
| M3|  Feb| 30|
| M4|  Jan| 40|
| M4|  Feb| 39|
+---+-----+---+

由...制作 -

val df1=Seq(
("M1","Jan","10"),
("M1","Feb","20"),
("M1","Mar","30"),
("M2","Jan","22"),
("M2","Feb","33"),
("M3","Jan","30"),
("M3","Feb","30"),
("M4","Jan","40"),
("M4","Feb","39")
).toDF("ID","Month","val")

我必须找出那些 val 小于或等于前几个月的行。

预期结果 -

+---+-----+---+-------+
| ID|Month|val|    Res|
+---+-----+---+-------+
| M1|  Jan| 10| true  |
| M1|  Feb| 20| true  |
| M1|  Mar| 30| true  |
| M2|  Jan| 22| true  |
| M2|  Feb| 33| true  |
| M3|  Jan| 30| true  |
| M3|  Feb| 30| false |
| M4|  Jan| 40| true  |
| M4|  Feb| 39| false |
+---+-----+---+-------+

标签: scalaapache-sparkapache-spark-sql

解决方案


您可以使用lag按转换为日期的月份排序的窗口:

import org.apache.spark.sql.expressions.Window

val df2 = df1.withColumn(
    "Res",
    coalesce(
        $"val" > lag($"val", 1).over(Window.partitionBy("ID").orderBy(to_date($"Month", "MMM"))),
        lit(true)
    )
).orderBy($"ID", to_date($"Month", "MMM"))

df2.show
+---+-----+---+-----+
| ID|Month|val|  Res|
+---+-----+---+-----+
| M1|  Jan| 10| true|
| M1|  Feb| 20| true|
| M1|  Mar| 30| true|
| M2|  Jan| 22| true|
| M2|  Feb| 33| true|
| M3|  Jan| 30| true|
| M3|  Feb| 30|false|
| M4|  Jan| 40| true|
| M4|  Feb| 39|false|
+---+-----+---+-----+

推荐阅读