首页 > 解决方案 > 时间不均匀的 Pandas groupby 通过 n 分钟的转换获得最后一个值

问题描述

我有一个 DataFrame 如下:

CreatedDate              |    ID     |         Target 

2018-07-03 19:10:19          id1             Available 
2018-07-03 19:10:20          id1             Available
2018-07-03 19:12:33          id1             Available 
2018-07-03 19:12:34          id1           Not Available
2018-07-03 19:15:24          id1             Available

2018-07-03 21:23:19          id2             Available
2018-07-03 21:23:20          id2           Not Available
2018-07-03 21:56:33          id2             Available
2018-07-03 22:01:34          id2           Not Available
2018-07-03 22:15:24          id2             Available
2018-07-03 22:16:24          id2             Available      
2018-07-03 22:17:23          id2             Available      
2018-07-03 22:17:24          id2             Available
2018-07-03 22:19:24          id2             Available      

这个想法是为每个组创建一个具有先前可用性的列。先前的可用性应是接近当前 createdDate 减去 2 分钟的“目标”值。

在实践中,结果应如下所示:

CreatedDate              |    ID     |        Target       |  Previous Availability

2018-07-03 19:10:19          id1             Available           NaN
2018-07-03 19:10:20          id1             Available           NaN
2018-07-03 19:12:33          id1             Available        Available
2018-07-03 19:12:34          id1           Not Available      Available
2018-07-03 19:15:24          id1             Available      Not Available

2018-07-03 21:23:19          id2             Available           NaN
2018-07-03 21:23:20          id2           Not Available         NaN
2018-07-03 21:56:33          id2             Available      Not Available
2018-07-03 22:01:34          id2           Not Available      Available
2018-07-03 22:15:24          id2             Available      Not Available
2018-07-03 22:16:24          id2             Available      Not Available
2018-07-03 22:17:23          id2             Available      Not Available
2018-07-03 22:17:24          id2             Available      Not Available
2018-07-03 22:19:24          id2             Available        Available

笔记:

标签: pandasgroup-bypandas-groupby

解决方案


您可能可以定义一个自定义函数,尽管这效率不高。

主要思想是让每一行查找较旧的可用性(至少两分钟)并返回最后一个。

def check_previous(row):
    current_id = row.ID
    current_time = row.CreatedDate
    try: 
        mask = (df.ID==current_id) & (df.CreatedDate<current_time-pd.Timedelta(minutes = 2))
        return df.loc[mask,'Target'].values[-1]
    except:
        return np.nan

df['Previous Availability'] = df.apply(check_previous,axis = 1)

编辑:

事实上,这段代码并不能很好地扩展到空间,因为您必须存储越来越大的掩码并将其应用于大数据帧。

请注意,对于计算时间,它几乎是线性的:

def create_and_apply(n_rows):
    dates = pd.to_datetime('2018-07-03 19:10:19') + np.cumsum([pd.Timedelta(seconds = delay) for delay in np.random.randint(300,size = n_rows)])
    ids = np.random.choice(['id1','id2'],size = n_rows,replace = True)
    targets = np.random.choice(['Available','Not Available'],size = n_rows,replace = True)
    df = pd.DataFrame([x for x in zip(dates,ids,targets)],columns = ['CreatedDate','ID','Target'])
    df['Previous Availability'] = df.apply(check_previous,axis = 1)

    return df


%timeit create_and_apply(10)
    12.4 ms ± 974 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit create_and_apply(100)
    178 ms ± 49.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit create_and_apply(1000)
    1.25 s ± 92.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit create_and_apply(10000)
    11.1 s ± 573 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

解决此问题的一种方法是处理数据帧的切片,例如,您可以按天拆分(例如,如果您在午夜左右有一段时间,您不关心)。

df['Previous Availability'] = np.nan
df['day'] = df.CreatedDate.dt.day

for current_id in df.ID.unique():
    for current_day in df.day.unique():
        mask = (df.ID == current_id) & (df.day == current_day)
        df.loc[mask,'Previous Availability'] = df.loc[mask].apply(check_previous,axis = 1)

df.drop('day',1,inplace = True)

通过一次处理数据帧的较小部分,这将使您的 RAM 变得更容易。


推荐阅读