首页 > 解决方案 > 如何在不使用 for_loop 的情况下多次切片 pd.dataframe?

问题描述

问题

例子

import pandas as pd
from datetime import datetime

customer_df = pd.DataFrame({'customer': ['A','B','C','A','D','E','K','A','D','F','P','J'],
                            'location': ['NY','TX','NY','UT','MA','NV','NY','TX','NY','UT','MA','NV']},
                           index = [datetime(2020,5,1,9), datetime(2020,5,1,11), datetime(2020,5,1,12),
                                    datetime(2020,5,1,18), datetime(2020,5,2,5), datetime(2020,5,2,10), 
                                    datetime(2020,5,2,19), datetime(2020,5,3,2), datetime(2020,5,3,10),
                                    datetime(2020,5,3,18), datetime(2020,5,4,20), datetime(2020,5,4,22)])

start_time_df = pd.DataFrame({'start_time':[datetime(2020,5,1,8), datetime(2020,5,2,8), datetime(2020,5,3,5)]})

end_time_df = pd.DataFrame({'end_time':[datetime(2020,5,1,17), datetime(2020,5,2,17), datetime(2020,5,3,20)]})

customer_df
>>>                  customer  location
2020-05-01 09:00:00      A        NY
2020-05-01 11:00:00      B        TX
2020-05-01 12:00:00      C        NY
2020-05-01 17:00:00      A        UT
2020-05-02 05:00:00      D        MA
2020-05-02 10:00:00      E        NV
2020-05-02 19:00:00      K        NY
2020-05-03 02:00:00      A        TX
2020-05-03 08:00:00      D        NY
2020-05-03 15:00:00      F        UT
2020-05-04 20:00:00      P        MA
2020-05-04 22:00:00      J        NV
sliced_df_list = []  # to store slices

start_time_series = start_time_df.loc[:,'start_time']
end_time_series = end_time_df.loc[:,'end_time']

for start_time, end_time in zip(start_time_series, end_time_series):
    sliced_df_list.append(customer_df.loc[start_time:end_time,:])
    
pd.concat(sliced_df_list)
>>>                  customer   location
2020-05-01 09:00:00      A         NY
2020-05-01 11:00:00      B         TX
2020-05-01 12:00:00      C         NY
2020-05-02 10:00:00      E         NV
2020-05-03 10:00:00      D         NY
2020-05-03 18:00:00      F         UT

标签: pythonpandasdataframe

解决方案


一开始,您的解决方案不涵盖员工时间不涵盖工作时间的情况,customer_df因此我的建议将涵盖:

这是我的建议:

(1) 将 start_time 和 end_time 放在一个名为的 DataFrame 中start_end

start_end=start_time_df
start_end['end_time']=end_time_df['end_time']

(2) 初始化 DataFramesliced_df_list并添加date_time到以后使用:

sliced_df_list=customer_df
sliced_df_list['date_time']= sliced_df_list.index

emp(3) 增加工作时间存在员工名单的栏目。

sliced_df_list['emp']= sliced_df_list['date_time'].apply(lambda x: start_end[(start_end['start_time']<=x) & (start_end['end_time']>=x)].index)

(4) 仅存储列表中的第一个员工,如果没有员工覆盖工作时间,则说 NaN

sliced_df_list['emp']=sliced_df_list['emp'].apply(lambda x: int(x[0]) if len(x)>0 else None)

最后,打印结果

print(sliced_df_list)
                    customer location           date_time  emp
2020-05-01 09:00:00        A       NY 2020-05-01 09:00:00  0.0
2020-05-01 11:00:00        B       TX 2020-05-01 11:00:00  0.0
2020-05-01 12:00:00        C       NY 2020-05-01 12:00:00  0.0
2020-05-01 18:00:00        A       UT 2020-05-01 18:00:00  NaN
2020-05-02 05:00:00        D       MA 2020-05-02 05:00:00  NaN
2020-05-02 10:00:00        E       NV 2020-05-02 10:00:00  1.0
2020-05-02 19:00:00        K       NY 2020-05-02 19:00:00  NaN
2020-05-03 02:00:00        A       TX 2020-05-03 02:00:00  NaN
2020-05-03 10:00:00        D       NY 2020-05-03 10:00:00  2.0
2020-05-03 18:00:00        F       UT 2020-05-03 18:00:00  2.0
2020-05-04 20:00:00        P       MA 2020-05-04 20:00:00  3.0
2020-05-04 22:00:00        J       NV 2020-05-04 22:00:00  3.0

现在sliced_df_list在列中保存员工编号emp,您可以通过排序、分组......等无循环地处理此数据框,并覆盖任何缺少的员工工作时间

修改和更多澄清将在下面的评论中给出

基于评论

为了获得相同的代码结果

sliced_df_list.dropna(inplace=True) # dromp all rows contains NaN (one slice)
a=sliced_df_list.drop(['emp','date_time'],axis=1)
print(a)
                    customer location
2020-05-01 09:00:00        A       NY
2020-05-01 11:00:00        B       TX
2020-05-01 12:00:00        C       NY
2020-05-02 10:00:00        E       NV
2020-05-03 10:00:00        D       NY
2020-05-03 18:00:00        F       UT
2020-05-04 20:00:00        P       MA
2020-05-04 22:00:00        J       NV

此外 ,如果我们想为每个员工制作一个单独的切片,那么我们使用groupby

sliced_df_list=list(sliced_df_list.groupby('emp')) # slicing based on emp

打印(切片_df_list)

[(0.0,
                      customer location           date_time  emp
  2020-05-01 09:00:00        A       NY 2020-05-01 09:00:00  0.0
  2020-05-01 11:00:00        B       TX 2020-05-01 11:00:00  0.0
  2020-05-01 12:00:00        C       NY 2020-05-01 12:00:00  0.0),
 (1.0,
                      customer location           date_time  emp
  2020-05-02 10:00:00        E       NV 2020-05-02 10:00:00  1.0),
 (2.0,
                      customer location           date_time  emp
  2020-05-03 10:00:00        D       NY 2020-05-03 10:00:00  2.0
  2020-05-03 18:00:00        F       UT 2020-05-03 18:00:00  2.0),
 (3.0,
                      customer location           date_time  emp
  2020-05-04 20:00:00        P       MA 2020-05-04 20:00:00  3.0
  2020-05-04 22:00:00        J       NV 2020-05-04 22:00:00  3.0)]

期待知道与循环方法相比在性能上的差异。

祝你好运


推荐阅读