首页 > 解决方案 > 如何识别和突出熊猫数据框每一行中的异常值

问题描述

我想对我的数据框执行以下操作:

  1. 对于每一行,识别异常值/异常
  2. 突出显示/着色已识别的异常值的单元格(最好是“红色”)
  3. 计算每行中已识别异常值的数量(存储在“anomaly_count”列中)
  4. 将输出导出为xlsx文件

请参阅下面的示例数据

np.random.seed([5, 1591])
df = pd.DataFrame(
    np.random.normal(size=(16,5)),
    columns=list('ABCDE')
)

df

        A            B          C         D            E
0   -1.685112   -0.432143   0.876200    1.626578    1.512677
1   0.401134    0.439393    1.027222    0.036267    -0.655949
2   -0.074890   0.312793    -0.236165   0.660909    0.074468
3   0.842169    2.759467    0.223652    0.432631    -0.484871
4   -0.619873   -1.738938   -0.054074   0.337663    0.358380
5   0.083653    0.792835    -0.643204   1.182606    -1.207692
6   -1.168773   -1.456870   -0.707450   -0.439400   0.319728
7   2.316974    -0.177750   1.289067    -2.472729   -1.310188
8   2.354769    1.099483    -0.653342   -0.532208   0.269307
9   0.431649    0.666982    0.361765    0.419482    0.531072
10  -0.124268   -0.170720   -0.979012   -0.410861   1.000371
11  -0.392863   0.933516    -0.502608   -0.759474   -1.364289
12  1.405442    -0.297977   0.477609    -0.046791   -0.126504
13  -0.711799   -1.042558   -0.970183   -1.672715   -0.524283
14  0.029966    -0.579152   0.648176    0.833141    -0.942752
15  0.824767    0.974580    0.363170    0.428062    -0.232174

期望的结果应该是这样的:


## I want to ONLY identify the outliers NOT remove or substitute them. I only used NaN to depict the outlier value. Ideally, the outlier values cell should be colored/highlighted 'red'.

## Please note:  the outliers NaN in the sample are randomly assigned.

 
        A            B          C         D            E      Anomaly_Count
0     NaN     -0.432143     0.876200         NaN  1.512677        2
1   0.401134    0.439393    1.027222    0.036267    -0.655949     0
2   -0.074890   0.312793    -0.236165   0.660909    0.074468      0
3   0.842169    NaN         0.223652    0.432631    -0.484871     1
4   -0.619873   -1.738938   -0.054074   0.337663    0.358380      0
5   0.083653    0.792835    -0.643204    NaN           NaN        2
6   -1.168773   -1.456870   -0.707450   -0.439400   0.319728      0
7   2.316974    -0.177750   1.289067    -2.472729   -1.310188     0
8   2.354769    1.099483    -0.653342   -0.532208   0.269307      0
9   0.431649    0.666982    0.361765    0.419482    0.531072      0
10  -0.124268   -0.170720   -0.979012   -0.410861   NaN           1
11  -0.392863   0.933516    -0.502608   -0.759474   -1.364289     0
12  1.405442    -0.297977   0.477609    -0.046791   -0.126504     0
13  -0.711799   -1.042558   -0.970183   -1.672715   -0.524283     0
14  0.029966    -0.579152   0.648176    0.833141    -0.942752     0
15  0.824767    NaN        0.363170     0.428062    -0.232174     1


请参阅下面的尝试,我对其他方法持开放态度

import numpy as np
from scipy import stats

def outlier_detection (data):
      # step I: identify the outliers in each row
      df[(np.abs(stats.zscore(df)) < 3).all(axis = 0)]  # unfortunately this removes the outliers which I dont want
      # step II: color/highlight the outlier cell 
      df = df.style.highlight_null('red')
      # Step III: count the number of outliers in each row 
      df['Anomaly_count'] = df.isnull().sum(axis=1)
      # step IV: export as xlsx file
      df.to_excel(r'Path to store the exported excel file\File Name.xlsx', sheet_name='Your sheet name', index = False)
      


outlier_detection(df)


谢谢你的时间。

标签: pythonpandasdataframeexport-to-excel

解决方案


这对我有用

import numpy as np
import pandas as pd
from scipy import stats

np.random.seed([5, 1591])
df = pd.DataFrame(
    np.random.normal(size=(16, 5)),
    columns=list('ABCDE')
)


mask = pd.DataFrame(abs(stats.zscore(df)) > 1, columns=df.columns)
df["Count"] = mask.sum(axis=1)
mask["Count"] = False
style_df = mask.applymap(lambda x: "background-color: red" if x else "")

sheet_name = "Values"
with pd.ExcelWriter("score_test.xlsx", engine="openpyxl") as writer:
    df.style.apply(lambda x: style_df, axis=None).to_excel(writer, 
                                                           sheet_name=sheet_name,
                                                           index=False)

这里的掩码是布尔条件,如果 zscore 超过限制,我们就为真。基于这个布尔掩码,我在偏离的单元格上创建了​​一个字符串数据框“style_df”,其值为“背景:红色”。style_df 的值由最后一条语句强加于 df 数据框的样式。

生成的 excel 文件现在看起来像这样 excel文件的内容


推荐阅读