首页 > 解决方案 > 使用 IQR 在 Pandas df 中标记异常值

问题描述

我想为按 store_id 划分的整体daily_visitors识别低于第25 个百分位或高于第75 个百分位的异常值,并在新列中将它们标记为1 == 异常值和0 == 无异常值

主DF

date         store_id     store_category   daily_visitors
2020-01-01   1            small            190
2020-01-02   1            small            187
2020-01-03   1            small            145
2020-01-04   1            small            156
2020-01-05   1            small            134343
2020-01-01   2            large            4635
2020-01-02   2            large            4321
2020-01-03   2            large            4534
2020-01-04   2            large            4242
2020-01-05   2            large            21 

输出DF

date         store_id     store_category   daily_visitors  outlier 
2020-01-01   1            small            190             0
2020-01-02   1            small            187             0
2020-01-03   1            small            145             0
2020-01-04   1            small            156             0
2020-01-05   1            small            134343          1
2020-01-01   2            large            4635            0
2020-01-02   2            large            4321            0
2020-01-03   2            large            21              1
2020-01-04   2            large            4242            0
2020-01-05   2            large            21              0

标签: pythonpandasdataframe

解决方案


您可以使用np.select

In [2272]: conditions = [df.daily_visitors > df.groupby('store_id')['daily_visitors'].transform('quantile', 0.75), df.daily_visitors < df.groupby('store_id')['daily_visitors'].transform('quantile', 0.25)]

In [2273]: choices = [1,1]

In [2276]: df['outlier'] = np.select(conditions, choices)

In [2277]: df
Out[2277]: 
         date  store_id store_category  daily_visitors  outlier
0  2020-01-01         1          small             190        0
1  2020-01-02         1          small             187        0
2  2020-01-03         1          small             145        1
3  2020-01-04         1          small             156        0
4  2020-01-05         1          small          134343        1
5  2020-01-01         2          large            4635        1
6  2020-01-02         2          large            4321        0
7  2020-01-03         2          large            4534        0
8  2020-01-04         2          large            4242        0
9  2020-01-05         2          large              21        1

推荐阅读