首页 > 解决方案 > 从python中的数据框中删除异常值

问题描述

对于作业,我必须根据不同的方法删除 csv 的异常值

在将 csv 打开到熊猫数据框后,我尝试使用 csv 的变量“高度”,但它一直给我错误或根本没有触及异常值,所有这些都试图在 python 中使用 KNN 方法

我写的代码如下

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import chi2
import pandas as pd
from sklearn.neighbors import NearestNeighbors
from sklearn.datasets import make_blobs


df = pd.read_csv("data.csv")

print(df.describe())
print(df.columns)

df['height'].plot(kind='hist')
print(df['height'].value_counts())

data= pd.DataFrame(df['height'],df['active'])

k=1
knn = NearestNeighbors(n_neighbors=k)
knn.fit([df['height']])
neighbors_and_distances = knn.kneighbors([df['height']])
knn_distances = neighbors_and_distances[0]
tnn_distance = np.mean(knn_distances, axis=1)
print(knn_distances)
PCM = df.plot(kind='scatter', x='x', y='y', c=tnn_distance, colormap='viridis')
plt.show()

数据是这样的:

id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,18393,2,168,62.0,110,80,1,1,0,0,1,0
1,20228,1,156,85.0,140,90,3,1,0,0,1,1
2,18857,1,50,64.0,130,70,3,1,0,0,0,1
3,17623,2,250,82.0,150,100,1,1,0,0,1,1

我不知道我错过了什么或做错了什么

标签: pythonpandasdataframe

解决方案


df = pd.read_csv("data.csv")
X = df[['height', 'weight']]
X.plot(kind='scatter', x='weight', y='height', colormap='viridis')
plt.show()

在此处输入图像描述

knn = NearestNeighbors(n_neighbors=2).fit(X)
distances, indices = knn.kneighbors(X)
X['distances'] = distances[:,1]
X.distances
0       1.000000
1       1.000000
2       1.000000
3       3.000000
4       1.000000
5       1.000000
6     133.958949
7     100.344407
       ...
X.plot(kind='scatter', x='weight', y='height', c='distances', colormap='viridis')
plt.show()

在此处输入图像描述

MAX_DIST = 10
X[distances < MAX_DIST]
    height  weight
0   162 78.0
1   162 78.0
2   151 76.0
3   151 76.0
4   171 84.0
...

最后过滤掉所有异常值:

MAX_DIST = 10
X = X[X.distances < MAX_DIST]

推荐阅读