首页 > 解决方案 > 根据数据框中的列变量或多索引删除异常值

问题描述

这是另一个 IQR 异常值问题。我有一个看起来像这样的数据框:

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=('red','yellow','green'))
df.loc[0:49,'Season'] = 'Spring'
df.loc[50:99,'Season'] = 'Fall'
df.loc[0:24,'Treatment'] = 'Placebo'
df.loc[25:49,'Treatment'] = 'Drug'
df.loc[50:74,'Treatment'] = 'Placebo'
df.loc[75:99,'Treatment'] = 'Drug'
df = df[['Season','Treatment','red','yellow','green']]
df

我想查找并删除每个条件的异常值(即 Spring Placebo、Spring Drug等)。不是整行,只是单元格。并希望为每个“红色”、“黄色”、“绿色”列执行此操作。

有没有办法在不将数据帧分解成一大堆子数据帧的情况下做到这一点,并且所有条件都单独分解?我不确定如果将“季节”和“治疗”作为列或索引处理,这是否会更容易。无论哪种方式我都很好。

我用 .iloc 和 .loc 尝试了一些东西,但我似乎无法让它工作。

标签: python-3.xpandasdataframemulti-indexoutliers

解决方案


如果需要用缺失值替换异常值,请使用GroupBy.transformwith ,然后通过andDataFrame.quantile比较较小和较大的值,按位比较链掩码并在默认替换中设置缺失值,因此未指定:DataFrame.ltDataFrame.gt|ORDataFrame.mask

np.random.seed(2020)
df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=('red','yellow','green'))
df.loc[0:49,'Season'] = 'Spring'
df.loc[50:99,'Season'] = 'Fall'
df.loc[0:24,'Treatment'] = 'Placebo'
df.loc[25:49,'Treatment'] = 'Drug'
df.loc[50:74,'Treatment'] = 'Placebo'
df.loc[75:99,'Treatment'] = 'Drug'
df = df[['Season','Treatment','red','yellow','green']]

g = df.groupby(['Season','Treatment'])
df1 = g.transform('quantile', 0.05)
df2 = g.transform('quantile', 0.95)

c = df.columns.difference(['Season','Treatment'])
mask = df[c].lt(df1) | df[c].gt(df2)
df[c] = df[c].mask(mask)

print (df)
    Season Treatment   red  yellow  green
0   Spring   Placebo   NaN     NaN   67.0
1   Spring   Placebo  67.0    91.0    3.0
2   Spring   Placebo  71.0    56.0   29.0
3   Spring   Placebo  48.0    32.0   24.0
4   Spring   Placebo  74.0     9.0   51.0
..     ...       ...   ...     ...    ...
95    Fall      Drug  90.0    35.0   55.0
96    Fall      Drug  40.0    55.0   90.0
97    Fall      Drug   NaN    54.0    NaN
98    Fall      Drug  28.0    50.0   74.0
99    Fall      Drug   NaN    73.0   11.0

[100 rows x 5 columns]

推荐阅读