python - 在 Python 中识别时间序列中的活动
问题描述
该图显示了水温与时间的关系。当有激活时,温度会升高。当激活结束时,温度将开始下降(尽管有时可能会有时间延迟)。
我想计算发生事件的次数(每个蓝色圆圈代表一次激活)。有时会出现随机噪音(红色圆圈 - 表示随机温度变化,但您可以看到只有增加或减少,但不是两者兼而有之,暗示这不是一个适当的事件)。
温度每变化 0.5°C,温度记录就会更新,与时间无关。
我曾尝试使用 1) 温差和 2) 相邻数据点的温度变化梯度来识别事件开始时间戳和结束时间戳,并将其计为一个事件。但这不是很准确。
有人告诉我,我应该只使用温差并将(增加 - 最高温度 - 降低)的模式确定为一个事件。任何想法什么是计算激活总数的合适方法?
更新1:
样本数据:
id timestamp temperature
27581 27822 2020-01-02 07:53:05.173 19.5
27582 27823 2020-01-02 07:53:05.273 20.0
27647 27888 2020-01-02 10:01:46.380 20.5
27648 27889 2020-01-02 10:01:46.480 21.0
27649 27890 2020-01-02 10:01:48.463 21.5
27650 27891 2020-01-02 10:01:48.563 22.0
27711 27952 2020-01-02 10:32:19.897 21.5
27712 27953 2020-01-02 10:32:19.997 21.0
27861 28102 2020-01-02 11:34:41.940 21.5
...
更新2:
试过:
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['Date'] = [datetime.datetime.date(d) for d in df['timestamp']]
df['Date'] = pd.to_datetime(df['Date'])
df = df[df['Date'] == '2020-01-02']
# one does not need duplicate temperature values,
# because the task is to find changing values
df2 = df.loc[df['temperature'].shift() != df['temperature']]
# ye good olde forward difference
der = np.diff(df2['temperature'])
# to have the same length as index
der = np.insert(der,len(der),np.NaN)
# make it column
df2['sig'] = np.sign(der)
# temporary array
evts = np.zeros(len(der))
# we find that points, where the signum is changing from 1 to -1, i.e. crosses zero
evts[(df2['sig'].shift() != df2['sig'])&(0 > df2['sig'])] = 1.0
# make it column for plotting
df2['events'] = evts
# preparing plot
fig,ax = plt.subplots(figsize=(20,20))
ax.xaxis_date()
ax.xaxis.set_major_locator(plticker.MaxNLocator(20))
# temperature itself
ax.plot(df2['temperature'],'-xk')
ax2=ax.twinx()
# 'events'
ax2.plot(df2['events'],'-xg')
## uncomment next two lines for plotting of signum
# ax3=ax.twinx()
# ax3.plot(df2['sig'],'-m')
# x-axis tweaking
ax.xaxis.set_major_formatter(mdates.DateFormatter('%H:%M'))
minLim = '2020-01-02 00:07:00'
maxLim = '2020-01-02 23:59:00'
plt.xlim(mdates.date2num(pd.Timestamp(minLim)),
mdates.date2num(pd.Timestamp(maxLim)))
plt.show()
并产生了一个带有消息的空白图表:
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:31: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:38: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
更新3:
编写一个 for 循环来生成每天的图表:
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['Date'] = df['timestamp'].dt.date
df.set_index(df['timestamp'], inplace=True)
start_date = pd.to_datetime('2020-01-01 00:00:00')
end_date = pd.to_datetime('2020-02-01 00:00:00')
df = df.loc[(df.index >= start_date) & (df.index <= end_date)]
for date in df['Date'].unique():
df_date = df[df['Date'] == date]
# one does not need duplicate temperature values,
# because the task is to find changing values
df2 = pd.DataFrame.copy(df_date.loc[df_date['temperature'].shift() != df_date['temperature']])
# ye good olde forward difference
der = np.sign(np.diff(df2['temperature']))
# to have the same length as index
der = np.insert(der,len(der),np.NaN)
# make it column
df2['sig'] = der
# temporary array
evts = np.zeros(len(der))
# we find that points, where the signum is changing from 1 to -1, i.e. crosses zero
evts[(df2['sig'].shift() != df2['sig'])&(0 > df2['sig'])] = 1.0
# make it column for plotting
df2['events'] = evts
# preparing plot
fig,ax = plt.subplots(figsize=(30,10))
ax.xaxis_date()
# df2['timestamp'] = pd.to_datetime(df2['timestamp'])
ax.xaxis.set_major_locator(plticker.MaxNLocator(20))
# temperature itself
ax.plot(df2['temperature'],'-xk')
ax2=ax.twinx()
# 'events'
g= ax2.plot(df2['events'],'-xg')
# x-axis tweaking
ax.xaxis.set_major_formatter(mdates.DateFormatter('%H:%M'))
minLim = '2020-01-02 00:07:00'
maxLim = '2020-01-02 23:59:00'
plt.xlim(mdates.date2num(pd.Timestamp(minLim)),
mdates.date2num(pd.Timestamp(maxLim)))
ax.autoscale()
plt.title(date)
print(np.count_nonzero(df2['events'][minLim:maxLim]))
plt.show(g)
该图有效,但计数无效。
更新4:
看起来有些图表(例如 2020-01-01、2020-01-04、2020-01-05)是在随机的时间片段上(可能是在周末)。有没有办法删除这些天?
解决方案
首先,我建议你增加点数,我的意思是在实验设置本身。
尽管如此,看起来人们可以从提供的数据中提取“事件”。这个想法很简单:我们需要找到以上升下降模式为特征的“峰值”。要找到上升和下降,自然要使用一阶导数,而且由于我们只对符号感兴趣(加号表示递增函数,减号表示递减函数),所以我简单地使用了一阶正向差分的符号。由于我们假设没有自发出现的峰值,我们需要找到符号变化的前向差异点。事实上,它是一个替代的二阶导数,实际上,我使用简单的二阶正向差分得到了几乎相同的结果,但是,并不那么方便。
我使用了下一个例程
# imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import matplotlib.ticker as plticker
# endimports
# path to csv
path = r'JanuaryData.csv'
# reading the csv
df = pd.read_csv(path,usecols=['timestamp','temperature'],parse_dates=True, index_col='timestamp')
# selecting the part for the analysis
startDate = '2020-01-01 00:00:00'
endDate = '2020-01-03 23:59:00'
df = df.loc[startDate:endDate]
# one does not need duplicate temperature values,
# because the task is to find changing values
df2 = df.loc[df['temperature'].shift() != df['temperature']]
# ye good olde forward difference
der = np.diff(df2['temperature'])
# to have the same length as index
der = np.insert(der,len(der),np.NaN)
# make it column
df2['sig'] = np.sign(der)
# temporary array
evts = np.zeros(len(der))
# we find that points, where the signum is changing from 1 to -1, i.e. crosses zero
evts[(df2['sig'].shift() != df2['sig'])*(0 > df2['sig'])] = 1.0
# make it column for plotting
df2['events'] = evts
# preparing plot
fig,ax = plt.subplots(figsize=(20,20))
ax.xaxis_date()
ax.xaxis.set_major_locator(plticker.MaxNLocator(20))
# temperature itself
ax.plot(df2['temperature'],'-xk')
ax2=ax.twinx()
# 'events'
ax2.plot(df2['events'],'-xg')
## uncomment next two lines for plotting of signum
# ax3=ax.twinx()
# ax3.plot(df2['sig'],'-m')
# x-axis tweaking
ax.xaxis.set_major_formatter(mdates.DateFormatter('%H:%M'))
minLim = '2020-01-02 00:07:00'
maxLim = '2020-01-02 23:59:00'
plt.xlim(mdates.date2num(pd.Timestamp(minLim)),
mdates.date2num(pd.Timestamp(maxLim)))
plt.show()
代码生成的图像: 绿色曲线峰值显示了相应温度峰值的开始,对于不那么直观的表示,我很抱歉。我试图分析 .csv 中的其他数据,看起来该算法运行良好。
编辑 #1 替换行
df2 = df.loc[df['temperature'].shift() != df['temperature']]
和
df2 = pd.DataFrame.copy(df.loc[df['temperature'].shift() != df['temperature']])
摆脱 SettingWithCopyWarning。
并且还重写与前向差异的行
# ye good olde forward difference
der = np.diff(df2['temperature'])
# to have the same length as index
der = np.insert(der,len(der),np.NaN)
# make it column
df2['sig'] = np.sign(der)
至
# ye good olde forward difference
der = np.sign(np.diff(df2['temperature']))
# to have the same length as index
der = np.insert(der,len(der),np.NaN)
# make it column
df2['sig'] = der
以防止np.sign()
有关 NaN 值的警告。
编辑#2 以打印范围使用的事件数
print(np.count_nonzero(df2['events'][minLim:maxLim]))
对于上面使用的限制,它打印 6,对于整个数据集,它给出 174。
推荐阅读
- jwt - Keycloak 使用 Istio 和 JWT 提供无效签名
- wordpress - 如何将两组帖子传递给wordpress模板
- python - 软件可以做些什么来防止另一个应用程序使用 PostMessage 向其窗口注入消息?
- git - 执行 git fetch 或 git pull 时如何在客户端拒绝 GIT 强制更新?
- c++ - 枚举值的组合爆炸(729 个组合...)
- python - 防止 symfit 模型共享参数对象
- rust - 如何使用 Warp 检查授权标头?
- python - Selenium:Ctrl+click 以类似人类的方式在新选项卡中打开
- wso2ei - wso2ei 6.4 - 返回 HTTP/1.1 202 接受
- python - 如何将字典从用户定义的函数返回到 pyspark 数据框?