pandas - 将数据框与日期时间窗口连接起来
问题描述
希望使用一个数据帧中落入另一个数据帧时间窗口的时间来匹配两个数据帧。
生产数据框
生产时间 | 产品 | 价值 | Worker_ID |
---|---|---|---|
2020-01-24 08:13:59 | 产品4 | 5.9 | 402 |
2020-01-24 08:15:38 | 产品5 | 5.7 | 402 |
2020-01-24 08:17:17 | 产品4 | 5.1 | 402 |
2020-01-25 22:13:59 | 产品4 | 5.9 | 402 |
2020-01-25 21:15:38 | 产品7 | 5.7 | 402 |
2020-01-26 02:17:17 | 产品2 | 5.1 | 402 |
2020-01-24 09:17:17 | 产品4 | 5.1 | 403 |
2020-01-25 21:13:59 | 产品5 | 5.9 | 403 |
位置数据框
地点 | window_start | window_stop | Worker_ID |
---|---|---|---|
Loc16 | 2020-01-24 05:00:00 | 2020-01-24 21:00:00 | 402 |
Loc27 | 2020-01-25 21:00:00 | 2020-01-26 05:00:00 | 402 |
Loc61 | 2020-01-24 05:00:00 | 2020-01-24 21:00:00 | 403 |
Loc27 | 2020-01-25 21:00:00 | 2020-01-26 05:00:00 | 403 |
结果如下所示:
地点 | window_start | window_stop | Worker_ID | 生产时间 | 产品 | 质量 |
---|---|---|---|---|---|---|
Loc16 | 2020-01-24 05:00:00 | 2020-01-24 21:00:00 | 402 | 2020-01-24 08:13:59 | 产品4 | 5.9 |
Loc16 | 2020-01-24 05:00:00 | 2020-01-24 21:00:00 | 402 | 2020-01-24 08:15:38 | 产品5 | 5.7 |
Loc16 | 2020-01-24 05:00:00 | 2020-01-24 21:00:00 | 402 | 2020-01-24 08:17:17 | 产品4 | 5.1 |
Loc27 | 2020-01-25 21:00:00 | 2020-01-26 05:00:00 | 402 | 2020-01-25 22:13:59 | 产品4 | 5.9 |
Loc27 | 2020-01-25 21:00:00 | 2020-01-26 05:00:00 | 402 | 2020-01-25 21:15:38 | 产品7 | 5.7 |
Loc27 | 2020-01-25 21:00:00 | 2020-01-26 05:00:00 | 402 | 2020-01-26 02:17:17 | 产品2 | 5.1 |
Loc61 | 2020-01-24 05:00:00 | 2020-01-24 21:00:00 | 403 | 2020-01-24 09:17:17 | 产品4 | 5.1 |
Loc27 | 2020-01-25 21:00:00 | 2020-01-26 05:00:00 | 403 | 2020-01-25 21:13:59 | 产品5 | 5.9 |
首先在 Worker_ID 上进行匹配,然后是生产日期时间落在该位置的日期时间窗口中的位置。
此代码有效:
possible_matches = location_df.merge(production_df,on='Worker_ID',how='left')
build_df = possible_matches[(possible_matches['Production Time'] >= possible_matches['window_start']) &
(possible_matches['Production Time'] <= possible_matches['window_stop'])]
但是当生产数据框中有数百万行而位置数据框中有数千行时不起作用。
寻找一种更有效的方法来执行此连接,该方法实际上适用于具有更多工作人员和位置的大型数据集。
解决方案
为避免崩溃,您可能必须在合并前检查日期时间:
我尝试生成 2 个数据帧,其中包含 10,000 条位置记录和 5,000,000 条生产记录。
dti = pd.date_range('2020-01-01', '2021-01-01', freq='H', closed='left')
df2 = pd.DataFrame({'Worker_ID': np.random.randint(100, 500, 10000)})
df2['window_start'] = np.random.choice(dti, len(df2))
df2['window_stop'] = df2['window_start'] + pd.DateOffset(hours=np.random.randint(4, 17))
df1 = pd.DataFrame({'Worker_ID': np.random.randint(100, 500, 5000000)})
df1['Production Time'] = pd.to_datetime(1e9 * np.random.randint(df2['window_start'].min().timestamp(), df2['window_stop'].max().timestamp(), len(df1)))
>>> df1
Worker_ID Production Time
0 263 2020-12-31 11:28:31
1 194 2020-09-19 04:57:17
2 139 2020-06-14 00:27:07
3 105 2020-04-14 02:45:05
4 484 2020-12-07 22:36:56
... ... ...
4999995 338 2020-05-29 18:30:39
4999996 455 2020-03-03 20:51:27
4999997 228 2020-12-19 01:43:12
4999998 197 2020-04-07 07:32:13
4999999 304 2020-07-06 14:51:39
[5000000 rows x 2 columns]
>>> df2
Worker_ID window_start window_stop
0 309 2020-10-07 18:00:00 2020-10-08 08:00:00
1 486 2020-01-24 19:00:00 2020-01-25 09:00:00
2 120 2020-11-05 10:00:00 2020-11-06 00:00:00
3 224 2020-04-08 15:00:00 2020-04-09 05:00:00
4 208 2020-01-08 23:00:00 2020-01-09 13:00:00
... ... ... ...
9995 218 2020-01-10 00:00:00 2020-01-10 14:00:00
9996 358 2020-10-12 03:00:00 2020-10-12 17:00:00
9997 474 2020-12-25 03:00:00 2020-12-25 17:00:00
9998 416 2020-10-26 20:00:00 2020-10-27 10:00:00
9999 443 2020-03-31 09:00:00 2020-03-31 23:00:00
[10000 rows x 3 columns]
# from tqdm import tqdm
# Convert datetime to arrays of int
ptime = df1['Production Time'].astype(int).values
wtime = df2[['window_start', 'window_stop']].astype(int).values
data = []
# for wid in tqdm(df2['Worker_ID'].unique()):
for wid in df2['Worker_ID'].unique():
i = df1.loc[df1['Worker_ID'] == wid]
j = df2.loc[df2['Worker_ID'] == wid]
m = [np.where((wtime[j.index, 0] <= p) & (p <= wtime[j.index, 1]), x, -1)
for x, p in enumerate(ptime[i.index])]
m = np.where(np.array(m) >= 0)
df = pd.concat([j.iloc[m[1]].reset_index(drop=True),
i.iloc[m[0]].reset_index(drop=True)], axis='columns')
data.append(df)
df = pd.concat(data)
老答案 创建和间隔索引以将每个生产时间绑定到相应的窗口并合并 Worker_ID 和间隔:
ii = pd.IntervalIndex.from_tuples(list(zip(dfl['window_start'], dfl['window_stop'])),
closed='left') # left means >= and <
dfp['interval'] = pd.cut(dfp['Production Time'], bins=ii)
dfl['interval'] = ii
>>> pd.merge(dfl, dfp, on=['Worker_ID', 'interval'], how='left') \
.drop(columns='interval')
Location window_start window_stop Worker_ID Production Time Product Value
0 Loc16 2020-01-24 05:00:00 2020-01-24 21:00:00 402 2020-01-24 08:13:59 Prod4 5.9
1 Loc16 2020-01-24 05:00:00 2020-01-24 21:00:00 402 2020-01-24 08:15:38 Prod5 5.7
2 Loc16 2020-01-24 05:00:00 2020-01-24 21:00:00 402 2020-01-24 08:17:17 Prod4 5.1
3 Loc27 2020-01-25 21:00:00 2020-01-26 05:00:00 402 2020-01-25 22:13:59 Prod4 5.9
4 Loc27 2020-01-25 21:00:00 2020-01-26 05:00:00 402 2020-01-25 21:15:38 Prod7 5.7
5 Loc27 2020-01-25 21:00:00 2020-01-26 05:00:00 402 2020-01-26 02:17:17 Prod2 5.1
推荐阅读
- android - 为 ListView 设置自定义适配器会导致我的应用跳转到上一个活动
- php - Symfony\Bundle\FrameworkBundle\Test\WebTestCase 类未找到
- odoo - 在 odoo10 向导上即时更新字段
- java - 使用 TestFairy 获取 Travis-CI 构建的视频记录
- google-cloud-platform - 如何将json数据保存在bigtable中?
- python - Wagtail + Django 问题 - 尽管 url 不是 wagtail 页面,但页面正在重定向到 wagtail
- docker - 在专用服务器上的 docker 中运行 docker?
- symfony - guzzlehttp 和 symfony 与 rest api 问题
- typescript - React Native with TypeScript:如何在 setup.js 中设置全局变量?
- reactjs - Python API 输出未在 ReactJS 中格式化