首页 > 解决方案 > 通过熊猫中的上下值替换列的元素(如果连续值相差 10 )

问题描述

我有一个包含温度列的数据框。在某些行的温度列中,连续值的差异超过 10,我想清理我的数据集。我想用上限和下限的平均值替换该值。

我尝试了一些有条件的替换,但这不起作用......

df.loc[df['Temperature1'] > 50, 'Temperature'] = 23

我已经尝试过了,但这会将所有高于 50 的元素更改为 23 .. 但我想比较两行并检查差异是否大于 10,而不仅仅是我必须替换..

标签: pythonpython-3.xpandasdataframe

解决方案


编辑: 添加了滚动窗口的示例(另请参见:窗口函数


您可以使用shift()将上排和下排的值放在中间行。

import pandas as pd

df = pd.DataFrame({'Temperature': [10,30,20,40,50]})

df['upper_row'] = df['Temperature'].shift()
df['lower_row'] = df['Temperature'].shift(-1)

print(df)

结果

   Temperature  upper_row  lower_row
0           10        NaN       30.0
1           30       10.0       20.0
2           20       30.0       40.0
3           40       20.0       50.0
4           50       40.0        NaN

然后你在一行中有三个值,你通常可以减去它们,计算平均值,比较它们等

df['difference'] = (df['Temperature'] - df['upper_row']).abs()
df['mean'] = (df['upper_row'] + df['lower_row'])/2

print(df)

结果

   Temperature  upper_row  lower_row  difference  mean
0           10        NaN       30.0         NaN   NaN
1           30       10.0       20.0        20.0  15.0
2           20       30.0       40.0        10.0  35.0
3           40       20.0       50.0        20.0  35.0
4           50       40.0        NaN        10.0   NaN

你可以替换值Temperature

df['Temperature'][ df['difference']>10 ] = df['mean']

print(df)

结果

   Temperature  upper_row  lower_row  difference  mean
0           10        NaN       30.0         NaN   NaN
1           15       10.0       20.0        20.0  15.0
2           20       30.0       40.0        10.0  35.0
3           35       20.0       50.0        20.0  35.0
4           50       40.0        NaN        10.0   NaN

完整示例:

import pandas as pd

df = pd.DataFrame({'Temperature': [10,30,20,40,50]})

df['upper_row'] = df['Temperature'].shift()
df['lower_row'] = df['Temperature'].shift(-1)
print(df)

df['difference'] = (df['Temperature'] - df['upper_row']).abs()
df['mean'] = (df['upper_row'] + df['lower_row'])/2
print(df)

df['Temperature'][ df['difference']>10 ] = df['mean']
print(df)

编辑:您还可以使用滚动窗口来处理两个或三个连续行。请参阅代码中的注释。

import pandas as pd

df = pd.DataFrame({'Temperature': [10,30,20,40,50]})

# work with two consecutive rows and result assign to last row
rw2 = df['Temperature'].rolling(2)
df['difference'] = rw2.apply(lambda rows:abs(rows[1] - rows[0]), raw=True)

# work with three consecutive rows and result assign to middle/center row
rw3 = df['Temperature'].rolling(3, center=True)
df['mean'] = rw3.apply(lambda rows:(rows[0] + rows[2])/2, raw=True)

print(df)

df['Temperature'][ df['difference']>10 ] = df['mean']
print(df)

推荐阅读