首页 > 解决方案 > How to find total time of employee in the office using Pyspark considering and removing out and in time difference

问题描述

Sample data

    Eid, TS, Event
    1,2020-12-30T09:00:00, IN
    1,2020-12-30T13:00:00, OUT
    1,2020-12-30T14:00:00, IN
    1,2020-12-30T17:00:00, OUT
    1,2020-12-30T17:30:00, IN
    1,2020-12-30T20:00:00, OUT
    2,2020-12-30T10:30:00, IN
    2,2020-12-30T15:00:00, OUT
    2,2020-12-30T16:30:00, IN
    2,2020-12-30T21:30:00, IN
    2,2020-12-30T22:30:00, OUT
    1,2020-12-31T09:00:00, IN
    1,2020-12-31T13:00:00, OUT
    1,2020-12-31T14:00:00, IN
    1,2020-12-31T17:00:00, OUT
    1,2020-12-31T17:30:00, IN
    1,2020-12-31T20:00:00, OUT
    2,2020-12-31T10:30:00, IN
    2,2020-12-31T15:00:00, OUT
    2,2020-12-31T16:30:00, IN
    2,2020-12-31T21:30:00, IN
    2,2020-12-31T23:30:00, OUT

My approach

# In[1]
#reading the file
df = spark.read.csv('lag_lead.csv',inferSchema=True,header=True)
df.show(100,False)
        
        
# In[2]:
    
df = df.withColumn('r_Date',col('ts').cast(DateType()))\
       .withColumn('new_ts',col('ts').cast('long'))
                    
# In[3]:
from pyspark.sql.window import *
df = df.withColumn('leading',(lead('new_ts').over(Window.partitionBy('eid','r_Date').orderBy(col('ts')))-col('new_ts')))
    
df.show(100,False)
df = df.withColumn('leading',when(col('event')=='OUT',col('leading')*-1).otherwise('leading'))

# In[4]:
df=df.groupBy('eid','event','ts','r_Date')\
     .agg(sum('leading').alias('sum_leading'))\
     .select('eid', 'event', 'ts', 'r_date', lit(col('sum_leading')/3600).alias('sum_leading'))
    
df = df.withColumn('find_total',(col('ts').cast('long')-lag(col('ts').cast('long'))
                                .over(Window.partitionBy('eid','r_date').orderBy('ts')))/3600)\
       .fillna(0,subset=('find_total','sum_leading'))
       
# In[5]:
df = df.withColumn('final_total_hrs',col('find_total')+col('sum_leading'))
df.groupBy('eid','r_date').agg(sum('final_total_hrs').alias('spent_hrs')).show()
        

Output

    eid|  r_date  |spent_hrs
      2|2020-12-31|     11.5
      1|2020-12-31|      9.5
      2|2020-12-30|     10.5
      1|2020-12-30|      9.5

Question What is the optimized way of doing above solution? Approach Using lead to find difference of value,
converted the timestamp to long because difference of timestamp is an interval type and then took a sum of the values. Can anyone help me getting the code optimized with less line of code.

标签: pythondataframeapache-sparkpysparkhive

解决方案


推荐阅读