首页 > 解决方案 > Python - 比较 2 列和 2 行中的 TS

问题描述

我有一个带有多个引擎的 Df、一个开始和结束 Dt 以及一个信息代码。(此处为示例)

   engine               start                 end  duration info  energy

20      a 2020-04-16 09:40:00 2020-04-17 00:00:00       860    1    1982
21      a 2020-04-17 00:01:00 2020-04-18 00:00:00      1439    1    3254
22      a 2020-04-18 00:01:00 2020-04-19 00:00:00      1439    1    3258
23      a 2020-04-19 00:01:00 2020-04-20 00:00:00      1439    1    3353
24      a 2020-04-20 00:01:00 2020-04-21 00:00:00      1439    1    3253
25      a 2020-04-21 00:01:00 2020-04-22 00:00:00      1439    1    3053
26      a 2020-04-22 00:01:00 2020-04-22 06:27:00       386    1     820
27      a 2020-04-23 15:26:00 2020-04-24 00:00:00       514    1    1836
28      a 2020-04-24 00:01:00 2020-04-25 00:00:00      1439    1    4797
29      a 2020-04-25 00:01:00 2020-04-26 00:00:00      1439    1    4265

我想对我的df进行分组,以了解每个引擎何时出现代码或日期中断。

我想比较喜欢

if dt.loc[i+1, 'start'] - dt.loc[i, 'end'] <= day:
        dt['diff']=  False
    else:
        dt['diff'] = True

那我可以cumsum df['diff']

例如,在这里,我想得到

引擎表格 2020-04-16 9:40 至 2020-04-22 6:27 能量和信息代码的总和以及引擎表格 2020-04-23 15:26 至 2020-04-26 0:00

我试试这个,但它不起作用

df = df.sort_values(['engine','start','code'])

dt = df[['start','end']]

diff = df['codea']

day = pd.Timedelta('1M')




breaks = dt.start.diff().apply(lambda x: False if x <= day else True)


breaks2 = diff.diff() != 0

groups = breaks.cumsum()

groups2 = breaks2.cumsum()

df['follow'] = groups

df['follow2'] = groups2


dfend = (df.groupby(['engine','follow','follow2','code'])
           .agg(
                   begin=('start','first'),

                end = ('end','last'),
                ensum = ('energy','sum')
                )
         )

多谢

标签: pythonpandasdataframe

解决方案


您可以使用groupby以下方法执行此操作:

import pandas as pd
import numpy as np
df = pd.DataFrame({"engine": ["a"]*10,
                    "start": pd.to_datetime(pd.Series(["2020-04-16 09:40:00", "2020-04-17 00:01:00",
                                                       "2020-04-18 00:01:00", "2020-04-19 00:01:00",
                                                       "2020-04-20 00:01:00", "2020-04-21 00:01:00",
                                                       "2020-04-22 00:01:00", "2020-04-23 15:26:00",
                                                       "2020-04-24 00:01:00", "2020-04-25 00:01:00"])),
                   "end": pd.to_datetime(pd.Series(["2020-04-17 00:00:00", "2020-04-18 00:00:00",
                                                    "2020-04-19 00:00:00", "2020-04-20 00:00:00",
                                                    "2020-04-21 00:00:00", "2020-04-22 00:00:00",
                                                    "2020-04-22 06:27:00", "2020-04-24 00:00:00",
                                                    "2020-04-25 00:00:00", "2020-04-26 00:00:00"])),
                   "duration": [860, 1439, 1439, 1439, 1439, 1439, 386, 514, 1439, 1439],
                   "info": [1]*10,
                   "energy": [1982, 3254, 3258, 3353, 3253, 3053, 820, 1838, 4797, 4265]})


minute = pd.Timedelta('1M')

df.groupby(["engine", "info"])\
   .apply(lambda x:\
          x.groupby(((x["start"] - x["end"].shift(1)) > minute).cumsum())
   .agg({"start": min, "end": max, "energy": np.sum}))

输出是:

                            start                 end  energy
engine info                                                  
a      1    0 2020-04-16 09:40:00 2020-04-22 06:27:00   18973
            1 2020-04-23 15:26:00 2020-04-26 00:00:00   10900

推荐阅读