首页 > 解决方案 > Pandas:Datetimeindex 和 Intervalindex 交集

问题描述

我有很长的时间序列,我需要在某些事件的间隔内设置值np.nanmeasures是一个datetimeindexed 数据帧,并且events是一个独特的datetimeindex不同。

措施如下:

| index               | measure  |
|---------------------|----------|
| 1970-01-01 00:00:15 | 0.471331 |
| 1970-01-01 00:02:37 | 0.069177 |
| 1970-01-01 00:03:59 | 0.955357 |
| 1970-01-01 00:06:17 | 0.107815 |
| 1970-01-01 00:06:24 | 0.046558 |
| 1970-01-01 00:06:25 | 0.056558 |
| 1970-01-01 00:08:12 | 0.837405 |

例如,如果时间戳只有一个事件1970-01-01 00:06:21并且删除值的间隔为 +/- 5 秒,则输出将是:

| index               | measure  |
|---------------------|----------|
| 1970-01-01 00:00:15 | 0.471331 |
| 1970-01-01 00:02:37 | 0.069177 |
| 1970-01-01 00:03:59 | 0.955357 |
| 1970-01-01 00:06:17 | np.nan   |
| 1970-01-01 00:06:24 | np.nan   |
| 1970-01-01 00:06:25 | np.nan   |
| 1970-01-01 00:08:12 | 0.837405 |

目前我正在使用以下方法对事件进行交互.loc

for i in range(events.shape[0]):
    measures.loc[events[i] - pd.Timedelta("4min"):\
                 events[i] + pd.Timedelta("1min") \
        ] = np.nan

现在这可行,但是两个数据帧都很大(事件:10k 行,测量 1.5m 行)。因此我不能像这样构造一个布尔索引:

measure_index = measures.index.to_numpy()
left_bounds = (events - pd.Timedelta("4min")).to_numpy()
right_bounds = (events + pd.Timedelta("1min")).to_numpy()
# The following product wouldn't fit in memory even with boolean dtype.
left_bool_array = measure_index >= left_bounds.reshape((-1,1)) 
right_bool_array = measure_index <= right_bounds.reshape((-1,1))
mask = np.sum( left_bool_array & right_bool_array.T ,axis= 0) 

离开加入有关措施的事件或重新索引事件也是不可能的,因为它们花费的时间太长。

然后我遇到了 pd.intervalindex:

left_bound = events - pd.Timedelta("4min")
right_bound = events + pd.Timedelta("1min")
interval_index=pd.IntervalIndex.from_arrays(left_bound,right_bound)

Intervalindexindex 有.contains()一个方法,它接受一个标量并返回“一个布尔掩码,该值是否包含在区间中”。但是,对于我的用例,我需要遍历度量框架并对每一行的布尔数组求和。我正在寻找这样的方法:

pandas.IntervalIndex.intersect(input: array_like) -> boolean_array (与输入的形状相同)

输出中的每个元素表示相应的输入值是否在任何区间内。

类似但不同的问题:

编辑以下答案中讨论的选项的性能:

len(事件)= 10000,len(测量)= 1525229

for _ in range(10):  
    left_bound = dilution_copy.index - pd.Timedelta("4min")
    right_bound = dilution_copy.index + pd.Timedelta("1min")

    for left,right in zip(left_bound,right_bound):
        measure_copy.loc[left:right]=np.nan
for _ in range(10):  
    sf = sc.Stairs(start=measure_copy.index, end = measure_copy.index[1:], value=measure_copy.values)
    mask = sc.Stairs(start=dilution_copy.index-pd.Timedelta('4 min'), end=dilution_copy.index+pd.Timedelta('1 min'))
    masked = sf.mask(mask)
    result = masked.sample(measure_copy.index, include_index=True)
for _ in range(10):
    left_bound = dilution_copy.index - pd.Timedelta("4min")
    right_bound = dilution_copy.index + pd.Timedelta("1min")

    for left,right in zip(left_bound,right_bound):
        measure_copy.iloc[bisect(measure_copy.index, left):bisect(measure_copy.index, right)]=np.nan

标签: pythonpandasperformancenumpydatetimeindex

解决方案


如果您的measures数据已经排序(或者排序一次不太耗时) - 您可以考虑使用bisect.

这是一个近似更完整的解决方案:

  • events检查可以在其中“插入”的每个元素measures
  • 检查此“插入点”两侧的时间戳是否在 5 秒内
  • 如果是,设置为 nan
def bisect_loop():
    for event in events:
        bisect_point = bisect.bisect(measures.index, event)
        keep_looking_lower = True
        while keep_looking_lower:
            lower_side_index = max(0, bisect_point - 1)
            lower_side_diff = event - measures.index[lower_side_index]
            if lower_side_diff.seconds < 5:
                measures.loc[measures.index[lower_side_index]] = np.nan
                bisect_point = max(0, bisect_point - 1)
            elif lower_side_diff.seconds >=5 or bisect_point == 0:
                keep_looking_lower = False
        keep_looking_higher = True
        while keep_looking_higher:
            higher_side_index = min(len(measures.index), bisect_point)
            higher_side_diff = event - measures.index[higher_side_index]
            if higher_side_diff.seconds < 5:
                measures.loc[measures.index[higher_side_index]] = np.nan
                bisect_point = min(len(measures.index), bisect_point + 1)
            elif higher_side_diff.seconds >=5 or bisect_point == len(measures.index):
                keep_looking_higher = False

以下是具有 150 个度量和 10 个事件的虚拟数据集的一些统计数据 -

df = pd.DataFrame({'year': [2000]*150, 'month': [2]*150, 'day': [12]*150, 'hour': np.random.choice(range(1), 150), 'minute': np.random.choice(range(60), 150), 'second': np.random.choice(range(60), 150)})

timestamps = pd.to_datetime(df)
measures = pd.concat([timestamps, pd.Series(np.random.rand(150))], axis=1)
measures = measures.set_index(0)
measures = measures.sort_index()

df = pd.DataFrame({'year': [2000]*150, 'month': [2]*150, 'day': [12]*150, 'hour': np.random.choice(range(24), 150), 'minute': np.random.choice(range(60), 150), 'second': np.random.choice(range(60), 150)})
events = pd.to_datetime(df).sample(10).reset_index(drop=True)

%timeit op_loop() # This is your loc based approach that is working
8.74 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


%timeit bisect_loop()
3.22 ms ± 45.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

推荐阅读