首页 > 解决方案 > 如何在 Pyspark windows 函数中使用 lag 和 Rangebetween

问题描述

我的数据如下所示。

+------------------+--------------------+----------------+
|               out|           timestamp|        Sequence|
+------------------+--------------------+----------------+
|0.5202757120132446|2019-11-07 00:00:...|               1|
|              null|2019-11-07 00:00:...|               2|
|              null|2019-11-07 00:00:...|               3|
|              null|2019-11-07 00:00:...|               4|
|0.5220348834991455|2019-11-07 00:00:...|               5|
| 0.724998414516449|2019-11-07 00:00:...|               6|
|              null|2019-11-07 00:00:...|               7|
|              null|2019-11-07 00:00:...|               8|
|0.7322611212730408|2019-11-07 00:00:...|               9|
|              null|2019-11-07 00:00:...|              10|
|              null|2019-11-07 00:00:...|              11|

现在我想用以前的序列值替换空值。我正在使用 Windows 功能来实现这一点,但出现以下错误

'Window Frame RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW must match the required frame ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING;'

我的代码:

window1 =Window.partitionBy('timestamp').orderBy('Sequence').rangeBetween(Window.unboundedPreceding,0)
df = df.withColumn('out',F.when(F.col('out').isNull(),F.lag('out').over(window1)).otherwise(F.col('out')))

标签: apache-sparkpysparkwindow-functions

解决方案


import sys
import pyspark.sql.functions as f
df.withColumn("newout", f.last('out', True).over(Window.partitionBy('timestamp').orderBy('sequence').rowsBetween(-sys.maxsize, 0))).show()

+------------------+--------------------+--------+------------------+
|               out|           timestamp|sequence|            newout|
+------------------+--------------------+--------+------------------+
|0.5202757120132446|2019-11-07 00:00:...|       1|0.5202757120132446|
|              null|2019-11-07 00:00:...|       2|0.5202757120132446|
|              null|2019-11-07 00:00:...|       3|0.5202757120132446|
|              null|2019-11-07 00:00:...|       4|0.5202757120132446|
|0.5220348834991455|2019-11-07 00:00:...|       5|0.5220348834991455|
| 0.724998414516449|2019-11-07 00:00:...|       6| 0.724998414516449|
|              null|2019-11-07 00:00:...|       7| 0.724998414516449|
|              null|2019-11-07 00:00:...|       8| 0.724998414516449|
|0.7322611212730408|2019-11-07 00:00:...|       9|0.7322611212730408|
|              null|2019-11-07 00:00:...|      10|0.7322611212730408|
|              null|2019-11-07 00:00:...|      11|0.7322611212730408|
+------------------+--------------------+--------+------------------+

推荐阅读