scala - 如何使用 scala 根据 spark 中的条件获取 row_number()
问题描述
我有以下数据框 -
+----+-----+---+
| val|count| id|
+----+-----+---+
| a| 10| m1|
| b| 20| m1|
|null| 30| m1|
| b| 30| m2|
| c| 40| m2|
|null| 50| m2|
+----+-----+---+
由...制作 -
val df1=Seq(
("a","10","m1"),
("b","20","m1"),
(null,"30","m1"),
("b","30","m2"),
("c","40","m2"),
(null,"50","m2")
)toDF("val","count","id")
我正在尝试在 row_number() 和窗口函数的帮助下进行排名,如下所示。
df1.withColumn("rannk_num", row_number() over Window.partitionBy("id").orderBy("count")).show
+----+-----+---+---------+
| val|count| id|rannk_num|
+----+-----+---+---------+
| a| 10| m1| 1|
| b| 20| m1| 2|
|null| 30| m1| 3|
| b| 30| m2| 1|
| c| 40| m2| 2|
|null| 50| m2| 3|
+----+-----+---+---------+
但我必须过滤那些具有空值列的记录 - val。
预期输出——
+----+-----+---+---------+
| val|count| id|rannk_num|
+----+-----+---+---------+
| a| 10| m1| 1|
| b| 20| m1| 2|
|null| 30| m1| NULL|
| b| 30| m2| 1|
| c| 40| m2| 2|
|null| 50| m2| NULL|
+----+-----+---+---------+
想知道这是否可以通过最小的更改来实现。val 和 count 列也可以有“n”个值。
解决方案
使用 null val 过滤这些行,为它们分配一个空行号,然后联合回原始数据帧。
val df1=Seq(
("a","10","m1"),
("b","20","m1"),
(null,"30","m1"),
("b","30","m2"),
("c","40","m2"),
(null,"50","m2")
).toDF("val","count","id")
df1.filter("val is not null").withColumn(
"rannk_num", row_number() over Window.partitionBy("id").orderBy("count")
).union(
df1.filter("val is null").withColumn("rannk_num", lit(null))
).show
+----+-----+---+---------+
| val|count| id|rannk_num|
+----+-----+---+---------+
| a| 10| m1| 1|
| b| 20| m1| 2|
| b| 30| m2| 1|
| c| 40| m2| 2|
|null| 30| m1| null|
|null| 50| m2| null|
+----+-----+---+---------+
推荐阅读
- javascript - Need help to match regex with conditional concatenation (JS)
- json - Filter Google Maps Markers based on Distance
- terraform - Sharing GCP Source repository for the Cloud build of second project
- python - Counting occurrences of a word in chunks in python (list comprehension)
- google-cloud-platform - 缺少 Google My Business API 的 Google Cloud Service 帐户?
- c - Const robustness of pointer inside a struct and performance
- python - Can't connect to Flask API running in VM from Host Maschine
- class - Referencing / Accessing a MutableSets-Items-name in a class with different variables(Strings and Sets) in a textbased game of a total Kotlin noob
- python - Disable the sort arrow in a specific column in QTableWidget with PyQt/PySide
- python-3.x - BigQuery Python client method to output JSON file as shown in UI