首页 > 解决方案 > 如何在熊猫中加入具有重叠时间窗口和匹配 ID 的两个数据框

问题描述

假设我有以下数据框来跟踪测试时间的开始和结束时间:

import pandas as pd
from datetime import datetime
dfA = pd.DataFframe({'test_id': [1,2], 
                    'start_time': [datetime.strptime("2019-06-01 04:00:00", "%Y-%m-%d %H:%M:%S")
                                 , datetime.strptime("2019-06-03 13:12:00", "%Y-%m-%d %H:%M:%S")],
                     'end_time': [datetime.strptime("2019-06-01 06:00:00", "%Y-%m-%d %H:%M:%S")
                                 , datetime.strptime("2019-06-03 15:29:00", "%Y-%m-%d %H:%M:%S")]})
dfB = pd.DataFframe({'test_id': [1,3], 
                    'start_time': [datetime.strptime("2019-06-01 02:00:00", "%Y-%m-%d %H:%M:%S")
                                 , datetime.strptime("2019-06-01 00:00:00", "%Y-%m-%d %H:%M:%S")],
                     'end_time': [datetime.strptime("2019-06-01 05:00:00", "%Y-%m-%d %H:%M:%S")
                                 , datetime.strptime("2019-06-01 02:00:00", "%Y-%m-%d %H:%M:%S")]})

我想执行等效的 SQL

select * from A
inner join B
on (A.start_time between B.start_time AND B.start_time
   OR A.end_time between B.start_time AND B.start_time
   OR B.start_time between A.start_time AND A.start_time
   OR B.end_time between A.start_time AND A.start_time)
  AND A.id = B.id

在熊猫。从这篇文章中,我了解到 pandas 不支持这种类型的连接,我将不得不numpy.where像这样使用:

# get the start and end times for both dataframes
Astart_time = dfA.start_time.values # a
Aend_time = dfA.end_time.values # b

Bstart_time = dfB.start_time.values # c
Bend_time = dfB.end_time.values # d

# We need to JOIN both pandas dataframe where there are overlapping
# timeframes. We check for these overlaps:
# (c <= a < d) OR (c <= b < d) OR (a <= c < b) OR (a <= d < b)
# sql equivalent of a INNER JOIN ON BETWEEN a range of values
A_records, B_records = np.where(((Astart_time[:, None] >= Bstart_time) & (Astart_time[:, None] < Bend_time))\
                              | ((Aend_time[:, None] >= Bstart_time) & (Aend_time[:, None] < Bend_time))\
                              | ((Astart_time[:, None] <= Bstart_time) & (Astart_time[:, None] > Bend_time))\
                              | ((Astart_time[:, None] <= Bend_time) & (Aend_time[:, None] > Bend_time)))

但是我无法弄清楚如何A.test_id == B.test_id在 numpy where 子句中添加条件。我只希望 test_id == 1 的记录从数据帧 A 和 B 中加入。我想在np.where子句中添加这个额外条件的原因是因为我的数据帧每个都包含几百万条记录,我不希望它们爆炸执行连接时增加我机器的内存。

标签: sqlpython-3.xpandasnumpyjoin

解决方案


有一个query

(dfA.merge(dfB, on='test_id', suffixes=['_a', '_b'])
   .query('start_time_b <= start_time_a <= end_time_b | ' +
          'start_time_b <= end_time_a <= end_time_b   | ' +
          'start_time_a <= start_time_b <= end_time_a | ' +
          'start_time_a <= end_time_b <= end_time_a'
   )
)

推荐阅读