apache-spark - 如何在 Pyspark windows 函数中使用 lag 和 Rangebetween
问题描述
我的数据如下所示。
+------------------+--------------------+----------------+
| out| timestamp| Sequence|
+------------------+--------------------+----------------+
|0.5202757120132446|2019-11-07 00:00:...| 1|
| null|2019-11-07 00:00:...| 2|
| null|2019-11-07 00:00:...| 3|
| null|2019-11-07 00:00:...| 4|
|0.5220348834991455|2019-11-07 00:00:...| 5|
| 0.724998414516449|2019-11-07 00:00:...| 6|
| null|2019-11-07 00:00:...| 7|
| null|2019-11-07 00:00:...| 8|
|0.7322611212730408|2019-11-07 00:00:...| 9|
| null|2019-11-07 00:00:...| 10|
| null|2019-11-07 00:00:...| 11|
现在我想用以前的序列值替换空值。我正在使用 Windows 功能来实现这一点,但出现以下错误
'Window Frame RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW must match the required frame ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING;'
我的代码:
window1 =Window.partitionBy('timestamp').orderBy('Sequence').rangeBetween(Window.unboundedPreceding,0)
df = df.withColumn('out',F.when(F.col('out').isNull(),F.lag('out').over(window1)).otherwise(F.col('out')))
解决方案
import sys
import pyspark.sql.functions as f
df.withColumn("newout", f.last('out', True).over(Window.partitionBy('timestamp').orderBy('sequence').rowsBetween(-sys.maxsize, 0))).show()
+------------------+--------------------+--------+------------------+
| out| timestamp|sequence| newout|
+------------------+--------------------+--------+------------------+
|0.5202757120132446|2019-11-07 00:00:...| 1|0.5202757120132446|
| null|2019-11-07 00:00:...| 2|0.5202757120132446|
| null|2019-11-07 00:00:...| 3|0.5202757120132446|
| null|2019-11-07 00:00:...| 4|0.5202757120132446|
|0.5220348834991455|2019-11-07 00:00:...| 5|0.5220348834991455|
| 0.724998414516449|2019-11-07 00:00:...| 6| 0.724998414516449|
| null|2019-11-07 00:00:...| 7| 0.724998414516449|
| null|2019-11-07 00:00:...| 8| 0.724998414516449|
|0.7322611212730408|2019-11-07 00:00:...| 9|0.7322611212730408|
| null|2019-11-07 00:00:...| 10|0.7322611212730408|
| null|2019-11-07 00:00:...| 11|0.7322611212730408|
+------------------+--------------------+--------+------------------+
推荐阅读
- svelte - 在 Svelte 组件中单击焦点切换
- ios - Int Variable 不允许我从 UI 文本字段中添加数据
- python - 根据条件将值插入数据框列
- swift - 带有字符串和 UIImages 的 Swift Codable 协议
- c++ - WINAPI 事件对象意外行为
- asp.net-mvc - 即使我可以看到它在浏览器中进行检查,图像也不会显示
- python - Python - 从字典中更改原始数据框对象?
- sql - 屏蔽列时从表中选择
- python - ecdsa 签名密钥格式
- python - 如何将 Keras MNIST 数据集与我自己的 MNIST 图像结合起来?