首页 > 解决方案 > 每天生成日期和向前填充列

问题描述

我有几个稀疏日期和值的数据集:

  date   | value
12/01/20 |   1
12/04/20 |   2
12/08/20 |   3

& 想为它们之间的每个日期创建一行,向前填充最后一个值,例如:

  date   | value
12/01/20 |   1
12/02/20 |   1
12/03/20 |   1
12/04/20 |   2
12/05/20 |   2
12/06/20 |   2
12/07/20 |   2
12/08/20 |   3

谢谢!

标签: pythonpyspark

解决方案


以下代码应该与您要查找的内容接近。

from pyspark.sql import functions as F
from pyspark.sql import Window
import datetime

df_all = spark.createDataFrame([
  {"date": datetime.date(2020, 12, 1), "value": 1},
  {"date": datetime.date(2020, 12, 4), "value": 2},
  {"date": datetime.date(2020, 12, 8), "value": 3}
])
df_all.show()
"""
+----------+-----+
|      date|value|
+----------+-----+
|2020-12-01|    1|
|2020-12-04|    2|
|2020-12-08|    3|
+----------+-----+
"""

window = Window.orderBy("date")

df_with_previous_date = df_all.withColumn("previous_date", F.lag("date", 1).over(window))
df_with_previous_value = df_with_previous_date.withColumn("previous_value", F.lag("value", 1).over(window))
df_with_days_between = df_with_previous_value.withColumn(
  "days_between",
  F.coalesce(
    F.datediff("previous_date", "date") + 1,
    F.lit(0)
  )
)

df_with_days_between.show()
"""
+----------+-----+-------------+--------------+------------+
|      date|value|previous_date|previous_value|days_between|
+----------+-----+-------------+--------------+------------+
|2020-12-01|    1|         null|          null|           0|
|2020-12-04|    2|   2020-12-01|             1|          -2|
|2020-12-08|    3|   2020-12-04|             2|          -3|
+----------+-----+-------------+--------------+------------+
"""


df_with_sequence = df_with_days_between.withColumn("day_offset_sequence", F.sequence(F.lit(0), "days_between"))
df_with_sequence.show()
"""
+----------+-----+-------------+--------------+------------+-------------------+
|      date|value|previous_date|previous_value|days_between|day_offset_sequence|
+----------+-----+-------------+--------------+------------+-------------------+
|2020-12-01|    1|         null|          null|           0|                [0]|
|2020-12-04|    2|   2020-12-01|             1|          -2|        [0, -1, -2]|
|2020-12-08|    3|   2020-12-04|             2|          -3|    [0, -1, -2, -3]|
+----------+-----+-------------+--------------+------------+-------------------+
"""


df_exploded = df_with_sequence.withColumn("day_offset", F.explode("day_offset_sequence"))
df_range = df_exploded.withColumn("date_index", F.col("date") + F.col("day_offset"))
df_range.show()

"""
+----------+-----+-------------+--------------+------------+-------------------+----------+----------+
|      date|value|previous_date|previous_value|days_between|day_offset_sequence|day_offset|date_index|
+----------+-----+-------------+--------------+------------+-------------------+----------+----------+
|2020-12-01|    1|         null|          null|           0|                [0]|         0|2020-12-01|
|2020-12-04|    2|   2020-12-01|             1|          -2|        [0, -1, -2]|         0|2020-12-04|
|2020-12-04|    2|   2020-12-01|             1|          -2|        [0, -1, -2]|        -1|2020-12-03|
|2020-12-04|    2|   2020-12-01|             1|          -2|        [0, -1, -2]|        -2|2020-12-02|
|2020-12-08|    3|   2020-12-04|             2|          -3|    [0, -1, -2, -3]|         0|2020-12-08|
|2020-12-08|    3|   2020-12-04|             2|          -3|    [0, -1, -2, -3]|        -1|2020-12-07|
|2020-12-08|    3|   2020-12-04|             2|          -3|    [0, -1, -2, -3]|        -2|2020-12-06|
|2020-12-08|    3|   2020-12-04|             2|          -3|    [0, -1, -2, -3]|        -3|2020-12-05|
+----------+-----+-------------+--------------+------------+-------------------+----------+----------+
"""

df_true_value = df_range.withColumn(
  "true_value",
  F.when(
    F.col("day_offset") == F.lit(0),
    F.col("value")
  ).otherwise(
    F.col("previous_value")
  )
)
df = df_true_value.select(
  F.col("date_index").alias("date"),
  F.col("true_value").alias("value")
).orderBy("date")
df.show()
"""
+----------+-----+
|      date|value|
+----------+-----+
|2020-12-01|    1|
|2020-12-02|    1|
|2020-12-03|    1|
|2020-12-04|    2|
|2020-12-05|    2|
|2020-12-06|    2|
|2020-12-07|    2|
|2020-12-08|    3|
+----------+-----+
"""

推荐阅读