首页 > 解决方案 > 将多个时间戳行合并为一个

问题描述

我有一个这样的系列:

s = pd.DataFrame({'ts': [1, 2, 3, 6, 7, 11, 12, 13]})
s

    ts
0   1
1   2
2   3
3   6
4   7
5   11
6   12
7   13

我想折叠差异小于 MAX_DIFF (2) 的行。这意味着所需的输出必须是:

[{'ts_from': 1, 'ts_to': 3},
 {'ts_from': 6, 'ts_to': 7},
 {'ts_from': 11, 'ts_to': 13}]

我做了一些编码:

s['close'] = s.diff().shift(-1)
s['close'] = s[s['close'] > MAX_DIFF].astype('bool')
s['close'].iloc[-1] = True

parts = []
ts_from = None

for _, row in s.iterrows():
    if row['close'] is True:
        part = {'ts_from': ts_from, 'ts_to': row['ts']}
        parts.append(part)
        ts_from = None
        continue
    
    if not ts_from:
        ts_from = row['ts']

这可行,但由于 iterrows() 似乎不是最优的。我考虑了排名,但无法弄清楚如何实现它们以便进一步分组排名。

有没有办法优化算法?

标签: pythonpandas

解决方案


您可以通过检查差异大于阈值的位置并获取累积来创建组。然后 agg 不管你想要什么,也许firstlast这种情况下。

gp = s['ts'].diff().abs().ge(2).cumsum().rename(None)
res = s.groupby(gp).agg(ts_from=('ts', 'first'),
                        ts_to=('ts', 'last'))
#   ts_from  ts_to
#0        1      3
#1        6      7
#2       11     13

如果你想要听写列表,那么:

res.to_dict('records')
#[{'ts_from': 1, 'ts_to': 3},
# {'ts_from': 6, 'ts_to': 7},
# {'ts_from': 11, 'ts_to': 13}]

为了完整起见,这里是 grouper 如何与 DataFrame 对齐:

s['gp'] = gp
print(s)

   ts  gp
0   1   0     # `1` becomes ts_from for group 0
1   2   0
2   3   0     # `3` becomes ts_to for group 0
3   6   1     # `6` becomes ts_from for group 1
4   7   1     # `7` becomes ts_to for group 1
5  11   2     # `11` becomes ts_from for group 2
6  12   2
7  13   2     # `13` becomes ts_to for group 2

推荐阅读