python - How to find total time of employee in the office using Pyspark considering and removing out and in time difference
问题描述
Sample data
Eid, TS, Event
1,2020-12-30T09:00:00, IN
1,2020-12-30T13:00:00, OUT
1,2020-12-30T14:00:00, IN
1,2020-12-30T17:00:00, OUT
1,2020-12-30T17:30:00, IN
1,2020-12-30T20:00:00, OUT
2,2020-12-30T10:30:00, IN
2,2020-12-30T15:00:00, OUT
2,2020-12-30T16:30:00, IN
2,2020-12-30T21:30:00, IN
2,2020-12-30T22:30:00, OUT
1,2020-12-31T09:00:00, IN
1,2020-12-31T13:00:00, OUT
1,2020-12-31T14:00:00, IN
1,2020-12-31T17:00:00, OUT
1,2020-12-31T17:30:00, IN
1,2020-12-31T20:00:00, OUT
2,2020-12-31T10:30:00, IN
2,2020-12-31T15:00:00, OUT
2,2020-12-31T16:30:00, IN
2,2020-12-31T21:30:00, IN
2,2020-12-31T23:30:00, OUT
My approach
# In[1]
#reading the file
df = spark.read.csv('lag_lead.csv',inferSchema=True,header=True)
df.show(100,False)
# In[2]:
df = df.withColumn('r_Date',col('ts').cast(DateType()))\
.withColumn('new_ts',col('ts').cast('long'))
# In[3]:
from pyspark.sql.window import *
df = df.withColumn('leading',(lead('new_ts').over(Window.partitionBy('eid','r_Date').orderBy(col('ts')))-col('new_ts')))
df.show(100,False)
df = df.withColumn('leading',when(col('event')=='OUT',col('leading')*-1).otherwise('leading'))
# In[4]:
df=df.groupBy('eid','event','ts','r_Date')\
.agg(sum('leading').alias('sum_leading'))\
.select('eid', 'event', 'ts', 'r_date', lit(col('sum_leading')/3600).alias('sum_leading'))
df = df.withColumn('find_total',(col('ts').cast('long')-lag(col('ts').cast('long'))
.over(Window.partitionBy('eid','r_date').orderBy('ts')))/3600)\
.fillna(0,subset=('find_total','sum_leading'))
# In[5]:
df = df.withColumn('final_total_hrs',col('find_total')+col('sum_leading'))
df.groupBy('eid','r_date').agg(sum('final_total_hrs').alias('spent_hrs')).show()
Output
eid| r_date |spent_hrs
2|2020-12-31| 11.5
1|2020-12-31| 9.5
2|2020-12-30| 10.5
1|2020-12-30| 9.5
Question
What is the optimized way of doing above solution?
Approach
Using lead to find difference of value,
converted the timestamp to long because difference of timestamp is an
interval type and then took a sum of the values.
Can anyone help me getting the code optimized with less line of code.
解决方案
推荐阅读
- python-3.x - 使用 python requests 库时,是否有某种方法可以加快请求和/或 timrout 错误?
- go - 在 go 中创建 grpc 服务器流端点时客户端流协议冲突
- ruby-on-rails - 使用除非 current_page 时遇到问题?隐藏按钮
- python - Python:如何显示嵌套类型信息?
- java - android中带有路线方向的室内地图
- mapbox-gl - 如何在 Mapbox GL 中创建一条线?
- firebase - 及时从 Firestore 中检索数据
- javascript - 如何在不点击下载链接的情况下从 react-pdf 生成下载的 pdf?
- laravel - 我可以将数据透视表数据合并到父对象中吗?
- javascript - 我可以编辑 MongoDB 对象并取消它的先前值吗?