首页 > 解决方案 > 找不到我的数据集的异常值(更具体地说是 IQR)

问题描述

试图在 python 中使用 pandas 查找 excel 表的异常值。我能够找到第一个和第三个四分位数,但不能在不返回的情况下从另一个四分位数中减去一个NaN

这是基本代码:

absent = pd.read_excel('Absenteeism_at_work.xls')

print("\nOUTLIERS:")
# q1 = (absent.loc[:741, ['Distance from Residence to Work']].quantile([0.25]))
# q3 = (absent.loc[:741, ['Distance from Residence to Work']].quantile([0.75]))

#print(absent.loc[:741, 'Distance from Residence to Work'].quantile([0.25])) #quartile

#print(q1)
# q1, q3 = absent.loc[:741, ['Distance from Residence to Work', 'Transportation expense', 'Month of absence',
  #                       'Social smoker', 'Social drinker', 'Education']].quantile([0.25 - 0.75])

print(absent.loc[:741, ['Distance from Residence to Work', 'Transportation expense', 'Month of absence',
                      'Social smoker', 'Social drinker', 'Education']].quantile([0.75])
   - absent.loc[:741, ['Distance from Residence to Work', 'Transportation expense', 'Month of absence',
                    'Social smoker', 'Social drinker', 'Education']].quantile([0.25]))

输出:

OUTLIERS:
      Distance from Residence to Work  Transportation expense  \
0.25                              NaN                     NaN   
0.75                              NaN                     NaN   

      Month of absence  Social smoker  Social drinker  Education  
0.25               NaN            NaN             NaN        NaN  
0.75               NaN            NaN             NaN        NaN  

标签: pythonpandasstatistics

解决方案


  1. 您的代码只是简单的四分位数范围计算。如果它可以为你工作,那很好。如果您需要真正的异常值检测,这比基于四分位数的模式更复杂,尤其是多变量,您可以求助于 python 包,如 sklearn 或 pyod。

  2. 使用分位数函数,您需要清理原始数据以确保它只是数字。特别是,您将 excel 文件导入为数据源。

  3. 通过检查数据

    tmp_df = 缺席.iloc[:741]

    cols = ['从住所到工作的距离','交通费用','缺勤月份','社交吸烟者','社交饮酒者','教育']

    打印(tmp_df[col].quantile(0.25,0.75))

    打印(tmp_df[col].describe(include='all'))

祝你好运。

怀俄明


推荐阅读