sql - 如何在熊猫中加入具有重叠时间窗口和匹配 ID 的两个数据框
问题描述
假设我有以下数据框来跟踪测试时间的开始和结束时间:
import pandas as pd
from datetime import datetime
dfA = pd.DataFframe({'test_id': [1,2],
'start_time': [datetime.strptime("2019-06-01 04:00:00", "%Y-%m-%d %H:%M:%S")
, datetime.strptime("2019-06-03 13:12:00", "%Y-%m-%d %H:%M:%S")],
'end_time': [datetime.strptime("2019-06-01 06:00:00", "%Y-%m-%d %H:%M:%S")
, datetime.strptime("2019-06-03 15:29:00", "%Y-%m-%d %H:%M:%S")]})
dfB = pd.DataFframe({'test_id': [1,3],
'start_time': [datetime.strptime("2019-06-01 02:00:00", "%Y-%m-%d %H:%M:%S")
, datetime.strptime("2019-06-01 00:00:00", "%Y-%m-%d %H:%M:%S")],
'end_time': [datetime.strptime("2019-06-01 05:00:00", "%Y-%m-%d %H:%M:%S")
, datetime.strptime("2019-06-01 02:00:00", "%Y-%m-%d %H:%M:%S")]})
我想执行等效的 SQL
select * from A
inner join B
on (A.start_time between B.start_time AND B.start_time
OR A.end_time between B.start_time AND B.start_time
OR B.start_time between A.start_time AND A.start_time
OR B.end_time between A.start_time AND A.start_time)
AND A.id = B.id
在熊猫。从这篇文章中,我了解到 pandas 不支持这种类型的连接,我将不得不numpy.where
像这样使用:
# get the start and end times for both dataframes
Astart_time = dfA.start_time.values # a
Aend_time = dfA.end_time.values # b
Bstart_time = dfB.start_time.values # c
Bend_time = dfB.end_time.values # d
# We need to JOIN both pandas dataframe where there are overlapping
# timeframes. We check for these overlaps:
# (c <= a < d) OR (c <= b < d) OR (a <= c < b) OR (a <= d < b)
# sql equivalent of a INNER JOIN ON BETWEEN a range of values
A_records, B_records = np.where(((Astart_time[:, None] >= Bstart_time) & (Astart_time[:, None] < Bend_time))\
| ((Aend_time[:, None] >= Bstart_time) & (Aend_time[:, None] < Bend_time))\
| ((Astart_time[:, None] <= Bstart_time) & (Astart_time[:, None] > Bend_time))\
| ((Astart_time[:, None] <= Bend_time) & (Aend_time[:, None] > Bend_time)))
但是我无法弄清楚如何A.test_id == B.test_id
在 numpy where 子句中添加条件。我只希望 test_id == 1 的记录从数据帧 A 和 B 中加入。我想在np.where
子句中添加这个额外条件的原因是因为我的数据帧每个都包含几百万条记录,我不希望它们爆炸执行连接时增加我机器的内存。
解决方案
有一个query
:
(dfA.merge(dfB, on='test_id', suffixes=['_a', '_b'])
.query('start_time_b <= start_time_a <= end_time_b | ' +
'start_time_b <= end_time_a <= end_time_b | ' +
'start_time_a <= start_time_b <= end_time_a | ' +
'start_time_a <= end_time_b <= end_time_a'
)
)
推荐阅读
- php - 如何使用嵌套的for循环在php中制作以下模式
- c - CS50 决选:投票算法将决选中选民的投票偏好制表
- r - 如何识别重复的单词以及句子中重复的位置和数量
- python - 网络爬虫 API
- flutter - Flutter:如何从 ExpansionPanelList 中删除高度?
- flutter - image_size_getter 安装错误(因为来自 sdk 的 flutter_test 的每个版本都取决于集合 [...])[flutter]
- cuda - 线程的维度性质与 CUDA 中数据本身的维度之间有什么相关性?
- python - DocType 在模块视图下不可见
- ios - 在 Swift 5 中传递用户通知时自动执行任务
- c# - Inno Setup 的安装程序在上传和下载周期后不需要管理员权限