首页 > 解决方案 > 局部离群因子仅针对某些点计算(scikitLearn)

问题描述

我有一个大的 csv 文件,其中包含 2 列代表 k-means 聚类的结果。我计算了 11 个质心,csv 文件包含最接近的质心以及该点与该质心的距离。

条目如下所示:

K11-closest,K11-distance
0,31544.821603570384
0,31494.23348984612
0,31766.471900874752
0,31710.896696452823

然后我想使用我在 scikit-learn.org 上找到的脚本来计算和绘制 LOF

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import LocalOutlierFactor
dataset = pd.read_csv('0.csv')

clf = LocalOutlierFactor(n_neighbors=20)
# use fit_predict to compute the predicted labels of the training samples
# (when LOF is used for outlier detection, the estimator has no predict,
# decision_function and score_samples methods).
y_pred = clf.fit_predict(dataset)

X_scores = clf.negative_outlier_factor_

plt.title("Local Outlier Factor (LOF)")
plt.scatter(dataset.iloc[:, 0], dataset.iloc[:, 1], color='k', s=3., label='Data points')
# plot circles with radius proportional to the outlier scores
radius = (X_scores.max() - X_scores) / (X_scores.max() - X_scores.min())
plt.scatter(dataset.iloc[:, 0].values, dataset.iloc[:, 1].values, s=50 * radius, edgecolors='r',
            facecolors='none', label='Outlier scores')
plt.show()

但情节显示: 在此处输入图像描述 黑点是日期点,红色是圆圈,显示它有多少是异常值

所以我假设 LOF 不是为每个点计算的。但为什么?以及我如何计算每一点?并使其在情节中可见

标签: python-3.xmachine-learningscikit-learndata-science

解决方案


规范化数据将帮助您制作更可见的图形,并且根据您的代码,您将半径的乘数设为 50,而我已采用 1000。

正如我们所看到的,该算法不会为每个数据点标记红色圆圈,它还取决于最近的邻居(n_neighbors),我们正在考虑算法来标记圆圈。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import LocalOutlierFactor
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

dataset = pd.DataFrame(data=[[0, 31544.821603570384], [0,31494.23348984612], \
                             [0,31766.471900874752], [0,31710.896696452823]], \
                       columns=["K11-closest","K11-distance"])

dataset = scaler.fit_transform(dataset)

clf = LocalOutlierFactor(n_neighbors=3)

y_pred = clf.fit_predict(dataset)

X_scores = clf.negative_outlier_factor_

plt.title("Local Outlier Factor (LOF)")
plt.scatter(dataset[:, 0], dataset[:, 1], color='k', s=3., label='Data points')
# plot circles with radius proportional to the outlier scores
radius = (X_scores.max() - X_scores) / (X_scores.max() - X_scores.min())
plt.scatter(dataset[:, 0], dataset[:, 1], s=1000 * radius, edgecolors='r',
            facecolors='none', label='Outlier scores')


legend = plt.legend(loc='upper left')
legend.legendHandles[0]._sizes = [10]
legend.legendHandles[1]._sizes = [20]
plt.show()

代码的结果


推荐阅读