首页 > 解决方案 > 根据 pandas 中的某些条件比较两个时间序列数据帧

问题描述

我有两个时间序列数据框df1df2

df1 = pd.DataFrame({'date_1':['10/11/2017 0:00','10/11/2017 03:00','10/11/2017 06:00','10/11/2017 09:00'],
                  'value_1':[5000,1500,np.nan,2000]})

df1['date_1'] = pd.to_datetime(df1.date_1.astype(str), format='%m/%d/%Y %H:%M',errors ='coerce') 
df1.index = pd.DatetimeIndex(df1.date_1)
df1.drop('date_1', axis = 1, inplace = True)

&

df2 = pd.DataFrame({'date_2': ['2017-10-11 00:00:00', '2017-10-11 00:30:00','2017-10-11 00:50:00', '2017-10-11 01:20:00',
                             '2017-10-11 01:40:00','2017-10-11 02:20:00','2017-10-11 02:50:00', '2017-10-11 03:00:00',
                             '2017-10-11 03:20:00', '2017-10-11 03:50:00','2017-10-11 04:20:00', '2017-10-11 04:50:00',
                             '2017-10-11 05:20:00', '2017-10-11 05:50:00','2017-10-11 06:00:00', '2017-10-11 06:20:00',
                             '2017-10-11 06:50:00', '2017-10-11 07:20:00','2017-10-11 07:50:00', '2017-10-11 08:20:00',
                             '2017-10-11 08:50:00', '2017-10-11 09:20:00','2017-10-11 09:50:00', '2017-10-11 10:20:00'],

                  'value_2':[1500.0, 2050.0,  np.nan,  2400.0, 
                           2500.0,  2550.0,  2900.0,  np.nan,
                           3200.0,  3500.0,  np.nan,  3600.0,
                           2600.0,  2500.0,  2350.0,  2200.0,
                           np.nan,  2100.0,  np.nan,  2400.0,
                           2600.0,  np.nan,  8000.0,  9000.0]
                    })
df2['date_2'] = pd.to_datetime(df2.date_2.astype(str), format='%Y-%m-%d %H:%M',errors ='coerce') 
df2.index = pd.DatetimeIndex(df2.date_2)
df2.drop('date_2', axis = 1, inplace = True)

两个数据框都是同一天的观察结果,但时间分辨率不同。df1时间分辨率为,3 hoursdf2时间分辨率为30 minutes或小于。我有兴趣dfx通过将上述数据框与某些条件进行比较来创建一个新的数据框,countdurationdfx.

对于: for on 边缘时间戳duration中的列,例如, where。是与 df2 进行比较的时间戳。dfxdfx['duration] = df2.index[i+1] - df2.index[i]count01:20:00dfx['duration] = (df1.index[i] + 01:30) - df2.index[i]df1.index[i]df1

期望的输出

dfx = pd.DataFrame({'date_2': ['2017-10-11 00:00:00', '2017-10-11 00:30:00','2017-10-11 00:50:00', '2017-10-11 01:20:00',
                             '2017-10-11 01:40:00','2017-10-11 02:20:00','2017-10-11 02:50:00', '2017-10-11 03:00:00',
                             '2017-10-11 03:20:00', '2017-10-11 03:50:00','2017-10-11 04:20:00', '2017-10-11 04:50:00',
                             '2017-10-11 05:20:00', '2017-10-11 05:50:00','2017-10-11 06:00:00', '2017-10-11 06:20:00',
                             '2017-10-11 06:50:00', '2017-10-11 07:20:00','2017-10-11 07:50:00', '2017-10-11 08:20:00',
                             '2017-10-11 08:50:00', '2017-10-11 09:20:00','2017-10-11 09:50:00', '2017-10-11 10:20:00'],

                  'count':[1, 1,  1,  1, 
                           0,  0,  0, 0,
                           0,  0,  0,  0,
                           0,  0,  0,  0,
                           0,  0,  1,  1,
                           1,  0,  0,  0],
                    
                    'duration':['00:30','00:20','00:30','00:10',
                                '00:00', '00:00', '00:00', '00:00',
                                '00:00', '00:00', '00:00', '00:00',
                                '00:00', '00:00', '00:00', '00:00',
                                '00:00', '00:00', '00:30', '00:30',
                                '00:10', '00:00', '00:00', '00:00']})
                        
dfx['date_2'] = pd.to_datetime(dfx.date_2.astype(str), format='%Y-%m-%d %H:%M',errors ='coerce') 
dfx.index = pd.DatetimeIndex(dfx.date_2)
dfx.drop('date_2', axis = 1, inplace = True)

尽管我希望缩短它,但我的问题已经变得很长。请忍耐一下。我非常感谢您的帮助。

谢谢!

标签: pythonpandasdataframedatetimetime-series

解决方案


输入数据:

>>> df1
                     value_1
date_1
2017-10-11 00:00:00   5000.0
2017-10-11 03:00:00   1500.0
2017-10-11 06:00:00   1200.0
2017-10-11 09:00:00      NaN

>>> df2
                     value_2
date_2
2017-10-11 00:00:00   1500.0
2017-10-11 00:30:00   2050.0
2017-10-11 00:50:00      NaN
2017-10-11 01:20:00   2400.0
2017-10-11 01:40:00   2500.0
...
2017-10-11 08:20:00   2400.0
2017-10-11 08:50:00   2600.0
2017-10-11 09:20:00      NaN
2017-10-11 09:50:00   8000.0
2017-10-11 10:20:00   9000.0
  1. 通过和NaN之间的线性插值从 df2填充值:t-1t+1
df2['value_2'] = df2['value_2'].interpolate()
  1. 根据您的规则从 df1 创建一个间隔:
ii = pd.IntervalIndex.from_tuples(
         list(zip(df1.index - pd.DateOffset(hours=1, minutes=29),
                  df1.index + pd.DateOffset(hours=1, minutes=30)))
     )
  1. 将值分类为离散区间:
df1['interval'] = pd.cut(df1.index, bins=ii)
df2['interval'] = pd.cut(df2.index, bins=ii)
  1. 合并两个数据框interval
dfx = pd.merge(df2, df1, on='interval', how='left').set_index('interval')
dfx = (dfx['value_2'].lt(2800) & dfx['value_1'].gt(1600)) \
          .astype(int).to_frame('count').set_index(df2.index)
  1. 将 with 的索引附加df1为 90 分钟的频率:
dti = df2.index.append(
          pd.DatetimeIndex(df1.index.to_series().resample('90T').groups.keys())
      ).sort_values().drop_duplicates()
dfx = dfx.reindex(dti).ffill().astype(int)
  1. 从 计算持续时间count并从 重新索引df2
dfx['duration'] = dfx.index.to_series().diff(-1).abs() \
                     .fillna(pd.Timedelta(0)).dt.components \
                     .apply(lambda x: f"{x['hours']:02}:{x['minutes']:02}",
                            axis='columns')

dfx.loc[dfx['count'] == 0, 'duration'] = '00:00'
dfx = dfx.reindex(df2.index)

输出结果:

>>> dfx
                     count duration
date_2
2017-10-11 00:00:00      1    00:30
2017-10-11 00:30:00      1    00:20
2017-10-11 00:50:00      1    00:30
2017-10-11 01:20:00      1    00:10
2017-10-11 01:40:00      0    00:00
2017-10-11 02:20:00      0    00:00
2017-10-11 02:50:00      0    00:00
2017-10-11 03:00:00      0    00:00
2017-10-11 03:20:00      0    00:00
2017-10-11 03:50:00      0    00:00
2017-10-11 04:20:00      0    00:00
2017-10-11 04:50:00      0    00:00
2017-10-11 05:20:00      0    00:00
2017-10-11 05:50:00      0    00:00
2017-10-11 06:00:00      0    00:00
2017-10-11 06:20:00      0    00:00
2017-10-11 06:50:00      0    00:00
2017-10-11 07:20:00      0    00:00
2017-10-11 07:50:00      1    00:30
2017-10-11 08:20:00      1    00:30
2017-10-11 08:50:00      1    00:10
2017-10-11 09:20:00      0    00:00
2017-10-11 09:50:00      0    00:00
2017-10-11 10:20:00      0    00:00

推荐阅读