首页 > 解决方案 > 在 Python 中识别时间序列中的活动

问题描述

该图显示了水温与时间的关系。当有激活时,温度会升高。当激活结束时,温度将开始下降(尽管有时可能会有时间延迟)。 在此处输入图像描述

我想计算发生事件的次数(每个蓝色圆圈代表一次激活)。有时会出现随机噪音(红色圆圈 - 表示随机温度变化,但您可以看到只有增加或减少,但不是两者兼而有之,暗示这不是一个适当的事件)。

温度每变化 0.5°C,温度记录就会更新,与时间无关。

我曾尝试使用 1) 温差和 2) 相邻数据点的温度变化梯度来识别事件开始时间戳和结束时间戳,并将其计为一个事件。但这不是很准确。

有人告诉我,我应该只使用温差并将(增加 - 最高温度 - 降低)的模式确定为一个事件。任何想法什么是计算激活总数的合适方法?


更新1:

样本数据:

        id      timestamp               temperature 
27581   27822   2020-01-02 07:53:05.173 19.5    
27582   27823   2020-01-02 07:53:05.273 20.0    
27647   27888   2020-01-02 10:01:46.380 20.5    
27648   27889   2020-01-02 10:01:46.480 21.0    
27649   27890   2020-01-02 10:01:48.463 21.5    
27650   27891   2020-01-02 10:01:48.563 22.0    
27711   27952   2020-01-02 10:32:19.897 21.5    
27712   27953   2020-01-02 10:32:19.997 21.0
27861   28102   2020-01-02 11:34:41.940 21.5    
...

更新2:

试过:

df['timestamp'] = pd.to_datetime(df['timestamp'])
df['Date'] = [datetime.datetime.date(d) for d in df['timestamp']] 
df['Date'] = pd.to_datetime(df['Date'])   
df = df[df['Date'] == '2020-01-02']

# one does not need duplicate temperature values, 
# because the task is to find changing values
df2 = df.loc[df['temperature'].shift() != df['temperature']]

# ye good olde forward difference
der = np.diff(df2['temperature'])
# to have the same length as index
der = np.insert(der,len(der),np.NaN)
# make it column
df2['sig'] = np.sign(der)

# temporary array
evts = np.zeros(len(der))
# we find that points, where the signum is changing from 1 to -1, i.e. crosses zero
evts[(df2['sig'].shift() != df2['sig'])&(0 > df2['sig'])] = 1.0
# make it column for plotting
df2['events'] = evts

# preparing plot
fig,ax = plt.subplots(figsize=(20,20))
ax.xaxis_date()
ax.xaxis.set_major_locator(plticker.MaxNLocator(20))

# temperature itself
ax.plot(df2['temperature'],'-xk')
ax2=ax.twinx()

# 'events'
ax2.plot(df2['events'],'-xg')

## uncomment next two lines for plotting of signum
# ax3=ax.twinx()
# ax3.plot(df2['sig'],'-m')

# x-axis tweaking
ax.xaxis.set_major_formatter(mdates.DateFormatter('%H:%M'))
minLim = '2020-01-02 00:07:00'
maxLim = '2020-01-02 23:59:00'
plt.xlim(mdates.date2num(pd.Timestamp(minLim)),
          mdates.date2num(pd.Timestamp(maxLim)))
plt.show()

并产生了一个带有消息的空白图表:

/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:31: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:38: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

更新3:

编写一个 for 循环来生成每天的图表:

df['timestamp'] = pd.to_datetime(df['timestamp'])   
df['Date'] = df['timestamp'].dt.date     
df.set_index(df['timestamp'], inplace=True)

start_date = pd.to_datetime('2020-01-01 00:00:00')
end_date = pd.to_datetime('2020-02-01 00:00:00')
df = df.loc[(df.index >= start_date) & (df.index <= end_date)]

for date in df['Date'].unique():   
  df_date = df[df['Date'] == date]

# one does not need duplicate temperature values, 
# because the task is to find changing values
  df2 = pd.DataFrame.copy(df_date.loc[df_date['temperature'].shift() != df_date['temperature']])

# ye good olde forward difference
  der = np.sign(np.diff(df2['temperature']))
# to have the same length as index
  der = np.insert(der,len(der),np.NaN)
# make it column
  df2['sig'] = der

# temporary array
  evts = np.zeros(len(der))
# we find that points, where the signum is changing from 1 to -1, i.e. crosses zero
  evts[(df2['sig'].shift() != df2['sig'])&(0 > df2['sig'])] = 1.0
# make it column for plotting
  df2['events'] = evts

# preparing plot
  fig,ax = plt.subplots(figsize=(30,10))

  ax.xaxis_date()
# df2['timestamp'] = pd.to_datetime(df2['timestamp'])
  ax.xaxis.set_major_locator(plticker.MaxNLocator(20)) 

# temperature itself
  ax.plot(df2['temperature'],'-xk')
  ax2=ax.twinx()

# 'events'
  g= ax2.plot(df2['events'],'-xg')

# x-axis tweaking
  ax.xaxis.set_major_formatter(mdates.DateFormatter('%H:%M'))
  minLim = '2020-01-02 00:07:00'
  maxLim = '2020-01-02 23:59:00'
  plt.xlim(mdates.date2num(pd.Timestamp(minLim)),
          mdates.date2num(pd.Timestamp(maxLim)))

  ax.autoscale()     
  plt.title(date)
  print(np.count_nonzero(df2['events'][minLim:maxLim]))
  plt.show(g)

该图有效,但计数无效。


更新4:

在此处输入图像描述

在此处输入图像描述 看起来有些图表(例如 2020-01-01、2020-01-04、2020-01-05)是在随机的时间片段上(可能是在周末)。有没有办法删除这些天?

标签: pythonpandasalgorithmnumpytime-series

解决方案


首先,我建议你增加点数,我的意思是在实验设置本身。
尽管如此,看起来人们可以从提供的数据中提取“事件”。这个想法很简单:我们需要找到以上升下降模式为特征的“峰值”。要找到上升和下降,自然要使用一阶导数,而且由于我们只对符号感兴趣(加号表示递增函数,减号表示递减函数),所以我简单地使用了一阶正向差分的符号。由于我们假设没有自发出现的峰值,我们需要找到符号变化的前向差异点。事实上,它是一个替代的二阶导数,实际上,我使用简单的二阶正向差分得到了几乎相同的结果,但是,并不那么方便。


我使用了下一个例程

# imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import matplotlib.ticker as plticker
# endimports

# path to csv
path = r'JanuaryData.csv'
# reading the csv
df = pd.read_csv(path,usecols=['timestamp','temperature'],parse_dates=True, index_col='timestamp')

# selecting the part for the analysis
startDate = '2020-01-01 00:00:00'
endDate = '2020-01-03 23:59:00'
df = df.loc[startDate:endDate]

# one does not need duplicate temperature values, 
# because the task is to find changing values
df2 = df.loc[df['temperature'].shift() != df['temperature']]

# ye good olde forward difference
der = np.diff(df2['temperature'])
# to have the same length as index
der = np.insert(der,len(der),np.NaN)
# make it column
df2['sig'] = np.sign(der)

# temporary array
evts = np.zeros(len(der))
# we find that points, where the signum is changing from 1 to -1, i.e. crosses zero
evts[(df2['sig'].shift() != df2['sig'])*(0 > df2['sig'])] = 1.0
# make it column for plotting
df2['events'] = evts

# preparing plot
fig,ax = plt.subplots(figsize=(20,20))
ax.xaxis_date()
ax.xaxis.set_major_locator(plticker.MaxNLocator(20))

# temperature itself
ax.plot(df2['temperature'],'-xk')
ax2=ax.twinx()

# 'events'
ax2.plot(df2['events'],'-xg')

## uncomment next two lines for plotting of signum
# ax3=ax.twinx()
# ax3.plot(df2['sig'],'-m')

# x-axis tweaking
ax.xaxis.set_major_formatter(mdates.DateFormatter('%H:%M'))
minLim = '2020-01-02 00:07:00'
maxLim = '2020-01-02 23:59:00'
plt.xlim(mdates.date2num(pd.Timestamp(minLim)),
          mdates.date2num(pd.Timestamp(maxLim)))
plt.show()

代码生成的图像: 代码生成的图像 绿色曲线峰值显示了相应温度峰值的开始,对于不那么直观的表示,我很抱歉。我试图分析 .csv 中的其他数据,看起来该算法运行良好。


编辑 #1 替换行

df2 = df.loc[df['temperature'].shift() != df['temperature']]

df2 = pd.DataFrame.copy(df.loc[df['temperature'].shift() != df['temperature']])

摆脱 SettingWithCopyWarning。

并且还重写与前向差异的行

# ye good olde forward difference
der = np.diff(df2['temperature'])
# to have the same length as index
der = np.insert(der,len(der),np.NaN)
# make it column
df2['sig'] = np.sign(der)

# ye good olde forward difference
der = np.sign(np.diff(df2['temperature']))
# to have the same length as index
der = np.insert(der,len(der),np.NaN)
# make it column
df2['sig'] = der

以防止np.sign()有关 NaN 值的警告。


编辑#2 以打印范围使用的事件数

print(np.count_nonzero(df2['events'][minLim:maxLim]))

对于上面使用的限制,它打印 6,对于整个数据集,它给出 174。


推荐阅读