首页 > 解决方案 > Outlier removal on a variable with several rows contain NAN (I need to keep the NAN and the position of the NAN also matters)

问题描述

I need to remove outliers from a variable which contains several NANs in it. It looks like this:

 X-velocity

1   0.0345
2   0.0222
3   0.0034
4   0.5604
5   0.4326
6    NaN
7   0.0333
8   0.3635
9   0.3345
10  0.3468
11  0.4573
12  0.7985
13  0.9359
14  NAN
15  0.4635
16  0.6857
17  0.4239
18  NAN
19  0.3849
20  0.3726
21  0.4637
22  0.3647
23  NAN
24  0.2938
25  0.5227

I need to remove the outlier from the variable without deleting or changing the value or position of the NAN. I don't mean NAN is the outlier, I mean the outlier of the continuous numbers. for example, I want to remove all the numbers that out of the range of mean +/- 3 * standard deviation. When I'm doing the outlier detection and removing, I don't want to affect the NANs, I want them to be there (since I need to perform other operation based on the NANs later).

Is there any possible way to do it? I appreciat any help.

标签: pythonoutliers

解决方案


If you have a method to determine whether something is an outlier or not (I imagine you have some threshold) you can create a new column that stores this flag.

For example:

# [True or False] is this more than 3 standard deviations away from the mean
df['is_outlier'] = abs(df['X-velocity'] - (df['X-velocity'].mean())/df['X-velocity'].std() > 3

You can then select values by using both this outlier flag OR whether the value is null:

# Select rows that contain non-outliers or null values 
filtered = df[(~df.is_outlier) | df['X-velocity'].isnull()]

推荐阅读