首页 > 解决方案 > 替代 pandas 应用函数,用于对具有多个数据帧的数据帧列进行逐行操作

问题描述

我有两个具有多个列的数据框,如下所示

头(df

    SCHEDULING_DC_NBR       COMMODITY_CODE          Unload_Start_Time   DOW     Dlry
0   6042.0                    SCGR                      15:15           SUN      5
1   6042.0                    SCGR                      15:30           SUN      6
2   6042.0                    SCGR                      15:45           SUN      7
3   6042.0                    SCGR                      16:15           SUN      8
4   6042.0                    SCGR                      18:30           SUN      9

头(config_df)

Node      Window               APPLICABLE_DAYS   COMMODITY_CODE  Window_start_time  
7023.0  03:15 AM to 03:16 AM            MON         SCPR                03:15   
7023.0  03:15 AM to 03:16 AM            THUR        SCPR                03:15
7023.0  03:15 AM to 03:16 AM            FRI         SCPR                03:15
6042.0  06:00 PM to 06:05 PM            SUN         SCPR                18:00   
6042.0  03:00 PM to 03:05 PM            SUN         SCGR                15:00

我想对数据帧 df 应用逐行操作,以使用如下的一些逻辑从config_df中找到合适的窗口,使用 apply 函数

def window_mapping(hist_df):
    window_times = []
    row = hist_df.copy()
    
    window_times = pd.to_datetime(config_df.loc[(config_df['Node'].values == row['SCHEDULING_DC_NBR'].values)
                                                           & (config_df['COMMODITY_CODE'].values == row['COMMODITY_CODE'].values)
                                                           & (config_df['APPLICABLE_DAYS'].str.contains(row['DOW'].values,case=False))
                                                           ,"Window_start_time"].values)

    if( len(window_times) > 0 ):
        if pd.to_datetime(row['Unload_Start_Time']) <= min(window_times):
             return config_df.loc[(config_df['Node'] == row['SCHEDULING_DC_NBR'])  & (config_df['COMMODITY_CODE'] == row['COMMODITY_CODE']) & ( config_df['Window_start_time'] == min(window_times).strftime('%H:%M')),"Window"].values[0], row['DOW']
            
        elif pd.to_datetime(row['Unload_Start_Time']) >= max(window_times):
             return config_df.loc[(config_df['Node'] == row['SCHEDULING_DC_NBR']) & (config_df['COMMODITY_CODE'] == row['COMMODITY_CODE']) & ( config_df['Window_start_time'] == max(window_times).strftime('%H:%M')),"Window"].values[0], row['DOW']
            
        else:
            # Find the difference of row['Unload_Start_Time'] with all window_times and get the smallest +ve difference
            differences = {}
            for times in window_times:
                
                differences[times] = (pd.to_datetime(row['Unload_Start_Time']) - times).seconds/60
            return config_df.loc[(config_df['Node'] == row['SCHEDULING_DC_NBR']) & (config_df['COMMODITY_CODE'] == row['COMMODITY_CODE']) & ( config_df['Window_start_time'] == min(differences, key=differences.get).strftime('%H:%M')),"Window"].values[0], row['DOW']
            
    else:
        return '',''

下面是用于应用上述函数的apply函数

a['Processed_window'],a['Processed_DOW'] = zip(*a.apply(window_mapping,axis=1))

应用 window_mapping 函数后的最终输出如下所示

SCHEDULING_DC_NBR   COMMODITY_CODE Unload_Start_Time   DOW   Processed_window      Processed_DOW
6042.0                   SCGR            15:15         SUN    03:00 PM to 03:05 PM    SUN
6042.0                   SCGR            15:30         SUN    03:00 PM to 03:05 PM    SUN
6042.0                   SCGR            15:45         SUN    03:00 PM to 03:05 PM    SUN
6042.0                   SCGR            16:15         SUN    03:00 PM to 03:05 PM    SUN
6042.0                   SCGR            18:30         SUN    06:00 PM to 06:05 PM    SUN

使用 apply 函数,它只花费了 1000 条记录的时间。我的原始数据框包含超过 12 万条记录。

**14.9 s ± 160 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)**

有没有更好的方法来做这个操作。我知道 np.where 和 np.select 可用于条件检查。但是,我所做的不仅仅是条件检查,它通过计算用于我的检查的列表和字典来进行计算。详细的解决方案或方法真的会有所帮助。

标签: pythonpandasnumpyapplynumba

解决方案


推荐阅读