首页 > 解决方案 > 将数据框与日期时间窗口连接起来

问题描述

希望使用一个数据帧中落入另一个数据帧时间窗口的时间来匹配两个数据帧。

生产数据框

生产时间 产品 价值 Worker_ID
2020-01-24 08:13:59 产品4 5.9 402
2020-01-24 08:15:38 产品5 5.7 402
2020-01-24 08:17:17 产品4 5.1 402
2020-01-25 22:13:59 产品4 5.9 402
2020-01-25 21:15:38 产品7 5.7 402
2020-01-26 02:17:17 产品2 5.1 402
2020-01-24 09:17:17 产品4 5.1 403
2020-01-25 21:13:59 产品5 5.9 403

位置数据框

地点 window_start window_stop Worker_ID
Loc16 2020-01-24 05:00:00 2020-01-24 21:00:00 402
Loc27 2020-01-25 21:00:00 2020-01-26 05:00:00 402
Loc61 2020-01-24 05:00:00 2020-01-24 21:00:00 403
Loc27 2020-01-25 21:00:00 2020-01-26 05:00:00 403

结果如下所示:

地点 window_start window_stop Worker_ID 生产时间 产品 质量
Loc16 2020-01-24 05:00:00 2020-01-24 21:00:00 402 2020-01-24 08:13:59 产品4 5.9
Loc16 2020-01-24 05:00:00 2020-01-24 21:00:00 402 2020-01-24 08:15:38 产品5 5.7
Loc16 2020-01-24 05:00:00 2020-01-24 21:00:00 402 2020-01-24 08:17:17 产品4 5.1
Loc27 2020-01-25 21:00:00 2020-01-26 05:00:00 402 2020-01-25 22:13:59 产品4 5.9
Loc27 2020-01-25 21:00:00 2020-01-26 05:00:00 402 2020-01-25 21:15:38 产品7 5.7
Loc27 2020-01-25 21:00:00 2020-01-26 05:00:00 402 2020-01-26 02:17:17 产品2 5.1
Loc61 2020-01-24 05:00:00 2020-01-24 21:00:00 403 2020-01-24 09:17:17 产品4 5.1
Loc27 2020-01-25 21:00:00 2020-01-26 05:00:00 403 2020-01-25 21:13:59 产品5 5.9

首先在 Worker_ID 上进行匹配,然后是生产日期时间落在该位置的日期时间窗口中的位置。

此代码有效:

possible_matches = location_df.merge(production_df,on='Worker_ID',how='left')
build_df = possible_matches[(possible_matches['Production Time'] >= possible_matches['window_start']) & 
                            (possible_matches['Production Time'] <= possible_matches['window_stop'])]

但是当生产数据框中有数百万行而位置数据框中有数千行时不起作用。

寻找一种更有效的方法来执行此连接,该方法实际上适用于具有更多工作人员和位置的大型数据集。

标签: pandas

解决方案


为避免崩溃,您可能必须在合并前检查日期时间:

我尝试生成 2 个数据帧,其中包含 10,000 条位置记录和 5,000,000 条生产记录。

dti = pd.date_range('2020-01-01', '2021-01-01', freq='H', closed='left')

df2 = pd.DataFrame({'Worker_ID': np.random.randint(100, 500, 10000)})
df2['window_start'] = np.random.choice(dti, len(df2))
df2['window_stop'] = df2['window_start'] + pd.DateOffset(hours=np.random.randint(4, 17))

df1 = pd.DataFrame({'Worker_ID': np.random.randint(100, 500, 5000000)})
df1['Production Time'] = pd.to_datetime(1e9 * np.random.randint(df2['window_start'].min().timestamp(), df2['window_stop'].max().timestamp(), len(df1)))
>>> df1
         Worker_ID     Production Time
0              263 2020-12-31 11:28:31
1              194 2020-09-19 04:57:17
2              139 2020-06-14 00:27:07
3              105 2020-04-14 02:45:05
4              484 2020-12-07 22:36:56
...            ...                 ...
4999995        338 2020-05-29 18:30:39
4999996        455 2020-03-03 20:51:27
4999997        228 2020-12-19 01:43:12
4999998        197 2020-04-07 07:32:13
4999999        304 2020-07-06 14:51:39

[5000000 rows x 2 columns]

>>> df2
      Worker_ID        window_start         window_stop
0           309 2020-10-07 18:00:00 2020-10-08 08:00:00
1           486 2020-01-24 19:00:00 2020-01-25 09:00:00
2           120 2020-11-05 10:00:00 2020-11-06 00:00:00
3           224 2020-04-08 15:00:00 2020-04-09 05:00:00
4           208 2020-01-08 23:00:00 2020-01-09 13:00:00
...         ...                 ...                 ...
9995        218 2020-01-10 00:00:00 2020-01-10 14:00:00
9996        358 2020-10-12 03:00:00 2020-10-12 17:00:00
9997        474 2020-12-25 03:00:00 2020-12-25 17:00:00
9998        416 2020-10-26 20:00:00 2020-10-27 10:00:00
9999        443 2020-03-31 09:00:00 2020-03-31 23:00:00

[10000 rows x 3 columns]
# from tqdm import tqdm

# Convert datetime to arrays of int
ptime = df1['Production Time'].astype(int).values
wtime = df2[['window_start', 'window_stop']].astype(int).values

data = []
# for wid in tqdm(df2['Worker_ID'].unique()):
for wid in df2['Worker_ID'].unique():
    i = df1.loc[df1['Worker_ID'] == wid]
    j = df2.loc[df2['Worker_ID'] == wid]

    m = [np.where((wtime[j.index, 0] <= p) & (p <= wtime[j.index, 1]), x, -1)
             for x, p in enumerate(ptime[i.index])]
    m = np.where(np.array(m) >= 0)

    df = pd.concat([j.iloc[m[1]].reset_index(drop=True),
                    i.iloc[m[0]].reset_index(drop=True)], axis='columns')

    data.append(df)
df = pd.concat(data)

老答案 创建和间隔索引以将每个生产时间绑定到相应的窗口并合并 Worker_ID 和间隔:

ii = pd.IntervalIndex.from_tuples(list(zip(dfl['window_start'], dfl['window_stop'])),
                                  closed='left')  # left means >= and <

dfp['interval'] = pd.cut(dfp['Production Time'], bins=ii)
dfl['interval'] = ii
>>> pd.merge(dfl, dfp, on=['Worker_ID', 'interval'], how='left') \
      .drop(columns='interval')

  Location        window_start         window_stop  Worker_ID     Production Time Product  Value
0    Loc16 2020-01-24 05:00:00 2020-01-24 21:00:00        402 2020-01-24 08:13:59   Prod4    5.9
1    Loc16 2020-01-24 05:00:00 2020-01-24 21:00:00        402 2020-01-24 08:15:38   Prod5    5.7
2    Loc16 2020-01-24 05:00:00 2020-01-24 21:00:00        402 2020-01-24 08:17:17   Prod4    5.1
3    Loc27 2020-01-25 21:00:00 2020-01-26 05:00:00        402 2020-01-25 22:13:59   Prod4    5.9
4    Loc27 2020-01-25 21:00:00 2020-01-26 05:00:00        402 2020-01-25 21:15:38   Prod7    5.7
5    Loc27 2020-01-25 21:00:00 2020-01-26 05:00:00        402 2020-01-26 02:17:17   Prod2    5.1

推荐阅读