首页 > 解决方案 > Spark(Scala)成对减去数据框中的所有行

问题描述

我有一个数据框,看起来类似于:

+-----+-----+------+-----+
|col1 |col2 |col3  |col4 |
+-----+-----+------+-----+
|1.1  |2.3  |10.0  |1    |
|2.2  |1.5  |5.0   |1    |
|3.3  |1.3  |1.5   |1    |
|4.4  |0.5  |7.0   |1    |
|5.5  |1.2  |8.1   |2    |
|6.6  |2.3  |8.2   |2    |
|7.7  |4.5  |10.3  |2    |
+-----+-----+------+-----+

我想从上面的行中减去每一行,但前提是它们在 col4 中有相同的条目,所以 2-1、3-2 但不是 5-4。col4 也不应该改变,所以结果是

+-----+-----+------+------+
|col1 |col2 |col3  |col4  |
+-----+-----+------+------+
|1.1  |-0.8 |-5.0  |1     |
|1.1  |-0.2 |-3.5  |1     |
|1.1  |-0.8 |5.5   |1     |
|1.1  |1.1  |0.1   |2     |
|1.1  |2.2  |2.1   |2     |
+-----+-----+------+------+

这听起来很简单,但我似乎无法弄清楚

标签: scalaapache-spark

解决方案


您可以使用spark-sql完成此操作,即使用您的数据框创建一个临时视图并应用以下 sql。它使用窗口函数减去由 排序和分区LAG的前一行值。每个分组中的第一行值由使用和过滤标识。col1col4col4row_number

df.createOrReplaceTempView('my_temp_view')

results = sparkSession.sql('<insert sql below here>')
SELECT
     col1,
     col2,
     col3,
     col4
FROM (
SELECT
    (col1 - (LAG(col1,1,0) OVER (PARTITION BY col4 ORDER BY col1) )) as col1,
    (col2 - (LAG(col2,1,0) OVER (PARTITION BY col4 ORDER BY col1) )) as col2,
    (col3 - (LAG(col3,1,0) OVER (PARTITION BY col4 ORDER BY col1) )) as col3,
    col4,
    ROW_NUMBER() OVER (PARTITION BY col4 ORDER BY col1) rn
FROM
    my_temp_view
) t 
WHERE rn <> 1

db小提琴


推荐阅读