apache-spark - PySpark:如何计算满足最后一个条件之间的天数(正面与负面)
问题描述
当前 DF(通过单个 userId 过滤,当 loss > 0 时 flag 为 1,当 <=0 时为 -1):
display(df):
+------+----------+---------+----+
| user|Date |RealLoss |flag|
+------+----------+---------+----+
|100364|2019-02-01| -16.5| 1|
|100364|2019-02-02| 73.5| -1|
|100364|2019-02-03| 31| -1|
|100364|2019-02-09| -5.2| 1|
|100364|2019-02-10| -34.5| 1|
|100364|2019-02-13| -8.1| 1|
|100364|2019-02-18| 5.68| -1|
|100364|2019-02-19| 5.76| -1|
|100364|2019-02-20| 9.12| -1|
|100364|2019-02-26| 9.4| -1|
|100364|2019-02-27| -30.6| 1|
+----------+------+---------+----+
期望的结果 df 应该显示自 lastwin ('RecencyLastWin') 和 lastloss ('RecencyLastLoss') 以来的天数
display(df):
+------+----------+---------+----+--------------+---------------+
| user|Date |RealLoss |flag|RecencyLastWin|RecencyLastLoss|
+------+----------+---------+----+--------------+---------------+
|100364|2019-02-01| -16.5| 1| null| null|
|100364|2019-02-02| 73.5| -1| 1| null|
|100364|2019-02-03| 31| -1| 2| 1|
|100364|2019-02-09| -5.2| 1| 8| 6|
|100364|2019-02-10| -34.5| 1| 1| 7|
|100364|2019-02-13| -8.1| 1| 1| 10|
|100364|2019-02-18| 5.68| -1| 5| 15|
|100364|2019-02-19| 5.76| -1| 6| 1|
|100364|2019-02-20| 9.12| -1| 7| 1|
|100364|2019-02-26| 9.4| -1| 13| 6|
|100364|2019-02-27| -30.6| 1| 14| 1|
+----------+------+---------+----+--------------+---------------+
我的方法如下:
from pyspark.sql.window import Window
w = Window.partitionBy("userId", 'PlayerSiteCode').orderBy("EventDate")
last_positive = check.filter('flag = "1"').withColumn('last_positive_day' , F.lag('EventDate').over(w))
last_negative = check.filter('flag = "-1"').withColumn('last_negative_day' , F.lag('EventDate').over(w))
finalcheck = check.join(last_positive.select('userId', 'PlayerSiteCode', 'EventDate', 'last_positive_day'), ['userId', 'PlayerSiteCode', 'EventDate'], how = 'left')\
.join(last_negative.select('userId', 'PlayerSiteCode', 'EventDate', 'last_negative_day'), ['userId', 'PlayerSiteCode', 'EventDate'], how = 'left')\
.withColumn('previous_date_played' , F.lag('EventDate').over(w))\
.withColumn('last_positive_day_count', F.datediff(F.col('EventDate'), F.col('last_positive_day')))\
.withColumn('last_negative_day_count', F.datediff(F.col('EventDate'), F.col('last_negative_day')))
然后我尝试添加(多次尝试..)但未能“完美”返回我想要的。
finalcheck = finalcheck.withColumn('previous_last_pos' , F.last('last_positive_day_count', True).over(w2))\
.withColumn('previous_last_neg' , F.last('last_negative_day_count', True).over(w2))\
.withColumn('previous_last_pos_date' , F.last('last_positive_day', True).over(w2))\
.withColumn('previous_last_neg_date' , F.last('last_negative_day', True).over(w2))\
.withColumn('recency_last_positive' , F.datediff(F.col('EventDate'), F.col('previous_last_pos_date')))\
.withColumn('day_since_last_negative_v1' , F.datediff(F.col('EventDate'), F.col('previous_last_neg_date')))\
.withColumn('days_off' , F.datediff(F.col('EventDate'), F.col('previous_date_played')))\
.withColumn('recency_last_negative' , F.when((F.col('day_since_last_negative_v1').isNull()), F.col('days_off')).otherwise(F.col('day_since_last_negative_v1')))\
.withColumn('recency_last_negative_v2' , F.when((F.col('last_negative_day').isNull()), F.col('days_off')).otherwise(F.col('day_since_last_negative_v1')))\
.withColumn('recency_last_positive_v2' , F.when((F.col('last_positive_day').isNull()), F.col('days_off')).otherwise(F.col('recency_last_positive')))
有什么建议/提示吗?(我发现了一个类似的问题,但没有弄清楚如何在我的具体情况下申请): 如何计算满足最后一个条件的天数?
解决方案
这是我的尝试。
有两个部分来计算这个。第一个是当输赢继续时,则应将日期差相加。为此,我将连续输赢标记为 1,并通过累积求和将它们分成分区组,直到标记的当前行。然后,我可以计算从上一次输赢到连续输赢结束的累计天数。
第二个是当输赢发生变化时,简单地获取上一场比赛和本场比赛的日期差。可以很容易地通过当前日期和上一个日期的差异来获得。
最后,将这些结果合并到一个列中。
from pyspark.sql.functions import lag, col, sum
from pyspark.sql import Window
w1 = Window.orderBy('Date')
w2 = Window.partitionBy('groupLossCheck').orderBy('Date')
w3 = Window.partitionBy('groupWinCheck').orderBy('Date')
df2 = df.withColumn('lastFlag', lag('flag', 1).over(w1)) \
.withColumn('lastDate', lag('Date', 1).over(w1)) \
.withColumn('dateDiff', expr('datediff(Date, lastDate)')) \
.withColumn('consecutiveLoss', expr('if(flag = 1 or lastFlag = 1, 0, 1)')) \
.withColumn('consecutiveWin' , expr('if(flag = -1 or lastFlag = -1, 0, 1)')) \
.withColumn('groupLossCheck', sum('consecutiveLoss').over(w1)) \
.withColumn('groupWinCheck' , sum('consecutiveWin' ).over(w1)) \
.withColumn('daysLastLoss', sum(when((col('consecutiveLoss') == 0) & (col('groupLossCheck') != 0), col('dateDiff'))).over(w2)) \
.withColumn('daysLastwin' , sum(when((col('consecutiveWin' ) == 0) & (col('groupWinCheck' ) != 0), col('dateDiff'))).over(w3)) \
.withColumn('lastLoss', expr('if(lastFlag = -1, datediff, null)')) \
.withColumn('lastWin' , expr('if(lastFlag = 1, dateDiff, null)')) \
.withColumn('RecencyLastLoss', coalesce('lastLoss', 'daysLastLoss')) \
.withColumn('RecencyLastWin', coalesce('lastWin' , 'daysLastwin' )) \
.orderBy('Date')
df2.show(11, False)
+------+----------+--------+----+--------+----------+--------+---------------+--------------+--------------+-------------+------------+-----------+--------+-------+---------------+--------------+
|user |Date |RealLoss|flag|lastFlag|lastDate |dateDiff|consecutiveLoss|consecutiveWin|groupLossCheck|groupWinCheck|daysLastLoss|daysLastwin|lastLoss|lastWin|RecencyLastLoss|RecencyLastWin|
+------+----------+--------+----+--------+----------+--------+---------------+--------------+--------------+-------------+------------+-----------+--------+-------+---------------+--------------+
|100364|2019-02-01|-16.5 |1 |null |null |null |0 |1 |0 |1 |null |null |null |null |null |null |
|100364|2019-02-02|73.5 |-1 |1 |2019-02-01|1 |0 |0 |0 |1 |null |1 |null |1 |null |1 |
|100364|2019-02-03|31.0 |-1 |-1 |2019-02-02|1 |1 |0 |1 |1 |null |2 |1 |null |1 |2 |
|100364|2019-02-09|-5.2 |1 |-1 |2019-02-03|6 |0 |0 |1 |1 |6 |8 |6 |null |6 |8 |
|100364|2019-02-10|-34.5 |1 |1 |2019-02-09|1 |0 |1 |1 |2 |7 |null |null |1 |7 |1 |
|100364|2019-02-13|-8.1 |1 |1 |2019-02-10|3 |0 |1 |1 |3 |10 |null |null |3 |10 |3 |
|100364|2019-02-18|5.68 |-1 |1 |2019-02-13|5 |0 |0 |1 |3 |15 |5 |null |5 |15 |5 |
|100364|2019-02-19|5.76 |-1 |-1 |2019-02-18|1 |1 |0 |2 |3 |null |6 |1 |null |1 |6 |
|100364|2019-02-20|9.12 |-1 |-1 |2019-02-19|1 |1 |0 |3 |3 |null |7 |1 |null |1 |7 |
|100364|2019-02-26|9.4 |-1 |-1 |2019-02-20|6 |1 |0 |4 |3 |null |13 |6 |null |6 |13 |
|100364|2019-02-27|-30.6 |1 |-1 |2019-02-26|1 |0 |0 |4 |3 |1 |14 |1 |null |1 |14 |
+------+----------+--------+----+--------+----------+--------+---------------+--------------+--------------+-------------+------------+-----------+--------+-------+---------------+--------------+
df2.select(*df.columns, 'RecencyLastLoss', 'RecencyLastWin').show(11, False)
+------+----------+--------+----+---------------+--------------+
|user |Date |RealLoss|flag|RecencyLastLoss|RecencyLastWin|
+------+----------+--------+----+---------------+--------------+
|100364|2019-02-01|-16.5 |1 |null |null |
|100364|2019-02-02|73.5 |-1 |null |1 |
|100364|2019-02-03|31.0 |-1 |1 |2 |
|100364|2019-02-09|-5.2 |1 |6 |8 |
|100364|2019-02-10|-34.5 |1 |7 |1 |
|100364|2019-02-13|-8.1 |1 |10 |3 |
|100364|2019-02-18|5.68 |-1 |15 |5 |
|100364|2019-02-19|5.76 |-1 |1 |6 |
|100364|2019-02-20|9.12 |-1 |1 |7 |
|100364|2019-02-26|9.4 |-1 |6 |13 |
|100364|2019-02-27|-30.6 |1 |1 |14 |
+------+----------+--------+----+---------------+--------------+
推荐阅读
- selenium-chromedriver - 获取 ChromeDriver 安装失败“未找到中央目录记录签名的结尾”
- java - java.sql.SQLNonTransientException: [Amazon][JDBC](10900) 并非所有参数都已填充
- javascript - 为 React 应用程序配置 Firebase - 未定义的配置
- javascript - 如何处理 promise 和箭头函数返回响应
- spring - 如何使用 bean 方法名称注入依赖项?
- haskell - 元组+自定义数据类型理解列表
- python - Heroku 上的 Django 连接到 Azure 数据库
- python - 如何在 Python 中创建这个矩阵?
- excel - ODBC 连接在网络上失败,但在不在网络上的计算机上工作正常
- reactjs - 如何限制 React 中字符串数据的行数?