首页 > 解决方案 > 计算 Pandas Dataframe 中排除异常值的行均值

问题描述

我有一个包含多行和多列的熊猫数据框。我计算了每一行的平均值并将其添加为一列。我想为每行的平均值创建另一列,不包括异常值(与该行平均值相差 1 个标准差的值)。

这就是我的数据框的a样子:

1 2 3 4 平均值
0 1034.3704 1068.1960 1243.7016 1057.1152 1026.4972
1 1042.5352 1071.6952 1052.1580 1020.6652 1534.9536
2 1013.0836 1819.7904 1022.1232 1030.5796 1036.4116
3 1043.4100 1069.9456 1015.1248 1026.2056 1034.9536

我尝试了以下方法:

def reject_outliers(data):
    m = 2 
    u = np.mean(data.T[0:5]) 
    s = np.std(data.T[0:5]) 
    filtered = [e for e in data.T[0:5] if (u - 2 * s < e < u + 2 * s)] 
    return filtered 

然而,这给我带来了错误:

ValueError: the truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()

标签: pythonpandasdataframe

解决方案


我将通过计算逐行平均值和标准差来接近它,然后将值设置为从平均值到 NaN 的 1 个标准差之外,并计算非 NaN 值的平均值:

import numpy as np
import pandas as pd

# Create an example dataframe
df = pd.DataFrame(np.random.randint(0, 100, size=(10, 5)), columns=[0,1,2,3,4])

# Calculate mean and std columns
df['mean'] = df.mean(axis=1)
df['std'] = df.loc[:,[0,1,2,3,4]].std(axis=1)

# Create boolean array of outliers
outliers = (df.loc[:,[0,1,2,3,4]] > np.tile(df['mean']+df['std'], (5,1)).T) | (df.loc[:,[0,1,2,3,4]] < np.tile(df['mean']-df['std'], (5,1)).T)
# Set outliers to NaN
df[outliers] = pd.NA
# Calculate new mean, ignoring NaNs
df['mean-no-outliers'] = df.loc[:,[0,1,2,3,4]].mean(axis=1, skipna=True)

这给了我一个数据框,df像这样(但你随机启动的数组将包含不同的值):

      0     1     2     3     4  mean        std  mean-no-outliers
0  47.0  38.0   NaN   6.0   NaN  36.2  36.182869         30.333333
1   NaN  77.0  87.0   NaN  97.0  53.8  46.056487         87.000000
2  72.0   NaN  59.0   NaN  60.0  63.8   8.729261         63.666667
3  27.0   NaN  34.0  13.0  22.0  32.4  20.280533         24.000000
4  43.0  78.0  97.0  89.0   NaN  66.6  30.664312         76.750000
5   6.0   NaN   5.0  26.0   6.0  15.2  13.292855         10.750000
6  36.0   NaN   NaN  33.0   NaN  47.2  25.302174         34.500000
7  11.0   NaN   8.0   8.0   8.0  16.2  16.709279          8.750000
8   NaN  28.0  41.0   NaN  40.0  37.2  16.754104         36.333333
9  55.0  67.0  64.0   NaN  34.0  48.6  19.269146         55.000000

推荐阅读