python - Pandas,如何使用 iterrow()、itertuple() 访问数据帧的子集、索引并找到峰值、趋势变化
问题描述
山顶
使用 X 和 Y 的状态来查找异常,其中 X 的值已达到峰值。
在异常周围获取数据框中的数据子集。例如,异常前 5 行和异常后 5 行。
异常也可以是全局趋势中局部趋势的起点。基本上,从数据帧中获取时间序列的子序列并查看此局部趋势以获取更多信息,特别是确认局部趋势没有逆转的信号。
识别和验证局部趋势是通过确认 X 值是 @ 最高点(即振荡值)。它也类似于直方图的中心值。我们需要通过前后的值来确认 X 峰值都是小于 X 峰值的值。理想情况下,我们希望在前后确认一些值。
样本数据
df = pd.DataFrame({
'X': [-0.27, -0.28, -0.33, -0.37, -0.60, -0.90, -0.99, -0.94, -0.85, -0.75, -0.64, -0.51, -0.35, -0.21, 1.78, 1.98, 2.08, 2.42, 2.56, 2.51, 2.57, 2.53, 2.37, 2.24, 2.11, 2.01, 1.82, 1.64, ],
'X_State': ['3', '3', '3', '3', '5', '5', '5', '5', '5', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '6', '6', '6', '6', '6', ],
'Y_State': ['23', '23', '23', '23', '24', '24', '24', '24', '24', '23', '23', '23', '22', '22', '18', '18', '18', '17', '17', '18', '17', '17', '18', '18', '18', '18', '18', '19', ],
})
df2 = pd.DataFrame() #create new empty dataframe
第二个数据框用于存储我们找到的子集数据。
代码
Label = []
# Get Previous
df['X_STATE_Previous_Value'] = df.X_State.shift(1)
df['Y_STATE_Previous_Value'] = df.Y_State.shift(1)
df['Y_STATE_Change'] = (df.Y_State.ne(df.Y_State.shift())).astype(int)
for index, row in df.iterrows():
if (row['Y_State'] == '17' and row['Y_STATE_Previous_Value'] == '18'):
Label.append('Index Position: ' + str(index))
# Select 5 rows before and after
df2 = df2.append(df.iloc[index-5:index+5])
# Find where X peaked
for i, row2 in df2.iterrows():
# get index position of the first instance of the largest value
peak = df2.X.idxmax()
# Go back and label where X peaked
df.loc[peak, 'Label'] = 'Top of Peak'
else:
Label.append('...')
df['Label'] = Label
df2['Max_Label'] = peak
print(df)
print(df2)
#del df2
需要帮助
第一的。峰顶标记不更新 df,即使它被引用为 df。它正在更新 df2,最终 df2 只是暂时的,可以帮助我们找到峰值。
第二,寻找更好的方法来确认Top of Peak。在子集中使用 max 的值,这实际上并不能确认之前和之后的值都是出租人。
解决方案
如果我理解,这就是我将如何做你正在寻找的事情:
import pandas as pd
df = pd.DataFrame({
'X': [-0.27, -0.28, -0.33, -0.37, -0.60, -0.90, -0.99, -0.94, -0.85, -0.75, -0.64, -0.51, -0.35, -0.21, 1.78, 1.98, 2.08, 2.42, 2.56, 2.51, 2.57, 2.53, 2.37, 2.24, 2.11, 2.01, 1.82, 1.64, ],
'X_State': ['3', '3', '3', '3', '5', '5', '5', '5', '5', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '6', '6', '6', '6', '6', ],
'Y_State': ['23', '23', '23', '23', '24', '24', '24', '24', '24', '23', '23', '23', '22', '22', '18', '18', '18', '17', '17', '18', '17', '17', '18', '18', '18', '18', '18', '19', ],
})
df['X_STATE_Previous_Value'] = df.X_State.shift(1)
df['Y_STATE_Previous_Value'] = df.Y_State.shift(1)
df['Y_STATE_Change'] = (df.Y_State.ne(df.Y_State.shift())).astype(int)
df['Label'] = '' #or '...' if you like better
# get a list of indexes where abnormality:
abnormal_idx = df[(df['Y_State'] == '17') & (df['Y_STATE_Previous_Value'] == '18')].index
# write it in column Label:
df.loc[abnormal_idx ,'Label'] = 'abnormality'
# get a subset of +/- 5 rows around abnormalities
df2 = df[min(abnormal_idx )-5:max(abnormal_idx )+5]
# and the max of X on this subset
peak_idx = df2.X.idxmax()
# you don't really df2, you can do directly: peak_idx = df[min(abnormal_idx )-5:max(abnormal_idx )+5].X.idxmax()
# add this number in a column, not sure why?
df['Max_Label'] = peak_idx
让我知道它是否适合您的需求。
编辑:对于最大子集,您可以执行以下操作:
df['subset_max'] = ''
for idx in abnormal_idx:
idx_max = df[idx-5:idx+6].X.idxmax()
#note the +6 instead of +5 as the upbound is not consider, sorry for that
if idx == idx_max:
df.loc[idx,'subset_max'] = 'max of the subset'
else:
df.loc[idx, 'subset_max'] = 'subset max at %s' % idx_max