首页 > 解决方案 > 带有 ffill 的 Pandas fillna 增加了噪音

问题描述

我正在尝试从 pandas DataFrame 中的列中删除异常值。

这是我的变量最初的样子(带有明显的异常值):

在此处输入图像描述

然后我决定删除任何有 +/-3 变化的东西(因为我知道不可能有那么多变化):

这有效,并给了我 NaN 来替换尖峰:

在此处输入图像描述

但是每当我尝试用之前的观察来替换现在缺失的值时,我都会以某种方式得到一些峰值!

在此处输入图像描述

有谁知道我做错了什么?

这是整个代码(在一个无限期的while循环中):

df = pd.DataFrame({'soc': [38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 127.0, 127.0, 66.48, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 127.0, 55.8, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0, 38.0]})
while (abs(df['soc'].diff()) > 3).any():
    df['soc'] = np.where(abs(df['soc'].diff()) > 3, np.nan, df['soc'])
    df['soc'].fillna(method='ffill', inplace=True)

标签: pythonpandasoutliersfillna

解决方案


I believe you are not deleting the values with a deviation of more than 3, because in the second plot, I can still the a dot that shouldn't show up. Maybe you are assigning in the wrong column too. This is a generic example of what you intend to do that is working:

df = pd.DataFrame({'A':[100,110,105,104,103,102,101]})
df['A'] = np.where(abs(df['A'].diff()) > 3,np.nan,df['A'])
df['A'] = df['A'].fillna(method='ffill')

In this example, 110 and 105 should be removed since they have a deviation of more than 3 between each other, and they will be replaced with 100. The output:

       A
0  100.0
1  100.0
2  100.0
3  104.0
4  103.0
5  102.0
6  101.0

推荐阅读