首页 > 解决方案 > Python Spark - 阈值后删除数据 - Pyspark

问题描述

如何在最后一次TP == 1使用 48 小时的缓冲区后删除数据?

例如ID = A9,最后一个TP == 1是 on 2020-05-06 13:00。我想将该组 ID 的所有数据保留到2020-05-06 13:00最后TP == 1加上接下来 2 天的位置?

+---++--------+----------------+
| id|       TP|            Date|
+---+---------+----------------+
| A1|     Null|2010-01-01 12:00|
| A1|     Null|2010-01-01 13:00|
| A1|        1|2010-01-02 01:00|
| A1|     Null|2010-01-02 02:00|
| A9|     Null|2010-05-05 12:00|
| A9|        1|2010-05-05 13:00|
| A9|        1|2010-05-06 13:00|
| A9|     Null|2010-05-09 13:00|
+---+---------+----------------+

所需的数据框

+---++--------+----------------+
| id|       TP|            Date|
+---+---------+----------------+
| A1|     Null|2010-01-01 12:00|
| A1|     Null|2010-01-01 13:00|
| A1|        1|2010-01-02 01:00|
| A1|     Null|2010-01-02 02:00|
| A9|     Null|2010-05-05 12:00|
| A9|        1|2010-05-05 13:00|
| A9|        1|2010-05-06 13:00|
+---+---------+----------------+

这就是我在 Pandas 中所做的,但对于 15M+ 的观察来说效率不高

main_pd = main.toPandas()

bigdf = pd.DataFrame()

for i in main_pd.ID.unique():
  df = main_pd[main_pd.ID == i]
  TPdate = df[df.TP == 1]['Date'].max()+pd.Timedelta('3 days 0 hours')
  df = df[(df.Date <= TPdate)]
  bigdf = bigdf.append(df)

标签: pythonapache-sparkpyspark

解决方案


IIUC,您可以使用 Window 函数查找max(IF(TP=1, Date, NULL))每个id然后按此阈值过滤:

from pyspark.sql import Window, functions as F
w1 = Window.partitionBy('id')

df_new = df.withColumn('Date', F.to_timestamp('Date', 'yyyy-MM-dd HH:mm')) \
    .withColumn('threshhold_date', F.expr("max(IF(TP=1, Date, NULL))").over(w1)) \
    .filter('Date <= threshhold_date + interval 2 days') 
df_new.show()
+---+----+-------------------+-------------------+
| id|  TP|               Date|    threshhold_date|
+---+----+-------------------+-------------------+
| A9|Null|2010-05-05 12:00:00|2010-05-06 13:00:00|
| A9|   1|2010-05-05 13:00:00|2010-05-06 13:00:00|
| A9|   1|2010-05-06 13:00:00|2010-05-06 13:00:00|
| A1|Null|2010-01-01 12:00:00|2010-01-02 01:00:00|
| A1|Null|2010-01-01 13:00:00|2010-01-02 01:00:00|
| A1|   1|2010-01-02 01:00:00|2010-01-02 01:00:00|
| A1|Null|2010-01-02 02:00:00|2010-01-02 01:00:00|
+---+----+-------------------+-------------------+

推荐阅读