首页 > 解决方案 > Pyspark:仅在每个日期的特定小时和分钟内向前填充

问题描述

如何仅在日期时间戳为 00:00:00 的情况下进行前向填充?

对于每date一个有一个00:00:00将有一个空,因为传感器不能正常工作。其他时候会有空值,需要保留它们。

+---+-------------------+-----+
| id|               date|value|
+---+-------------------+-----+
| A1|2016-09-30 23:00:00|    3|
| A1|2016-10-01 00:00:00| Null|
| A1|2016-10-01 01:00:00|    1|
| A1|2016-10-01 02:30:30|    3|
| A9|2016-10-05 23:00:00|    3|
| A9|2016-10-06 00:00:00| Null|
| A9|2016-10-06 02:20:00|    4|
| A9|2016-10-06 03:20:00| Null|
+---+-------------------+-----+

所需的数据框:

+---+-------------------+-----+
| id|               date|value|
+---+-------------------+-----+
| A1|2016-09-30 23:00:00|    3|
| A1|2016-10-01 00:00:00|    3|
| A1|2016-10-01 01:00:00|    1|
| A1|2016-10-01 02:30:30|    3|
| A9|2016-10-05 23:00:00|    3|
| A9|2016-10-06 00:00:00|    3|
| A9|2016-10-06 02:20:00|    4|
| A9|2016-10-06 03:20:00| Null|
+---+-------------------+-----+

标签: pythonapache-sparkpyspark

解决方案


您可以使用lag功能:

from pyspark.sql import functions as F
from pyspark.sql.functions import *
from pyspark.sql.window import Window

w=Window().partitionBy("id").orderBy("date")

df.withColumn("value", F.when(col("date").like("%00:00:00"), \
        F.lag("value").over(w)).otherwise(col("value"))).show()

+---+-------------------+-----+
| id|               date|value|
+---+-------------------+-----+
| A1|2016-09-30 23:00:00|    3|
| A1|2016-10-01 00:00:00|    3|
| A1|2016-10-01 01:00:00|    1|
| A1|2016-10-01 02:30:30|    3|
| A9|2016-10-05 23:00:00|    3|
| A9|2016-10-06 00:00:00|    3|
| A9|2016-10-06 02:20:00|    4|
| A9|2016-10-06 03:20:00| null|
+---+-------------------+-----+

推荐阅读