首页 > 解决方案 > 使用 KNN、SoftImpute 进行数据插补

问题描述

我想使用 MICE、KNN 和 Soft Impute 比较来自 fancyimpute 包的插补值,但是,当我运行我的代码时,与 MICE 插补的更有意义的值相比,KNN 和 SoftImpute 只为我的值插补了 0。

imputed_numerical=train[['Age']].select_dtypes(include='number']).as_matrix()

Age_MICE=MICE().complete(imputed_numerical)
Age_KNN=KNN(k=3).complete(imputed_numerical)
Age_SoftImpute=SoftImpute().complete(imputed_numerical)

我将结果放在一个如下所示的数据框中:

Not_Imputed MICE    KNN SoftImpute
   22.0    [22.0]  [22.0]  [22.0]
   38.0    [38.0]  [38.0]  [38.0]
   26.0    [26.0]  [26.0]  [26.0]
   35.0    [35.0]  [35.0]  [35.0]
   35.0    [35.0]  [35.0]  [35.0]
   NaN     [29]    [0.0]   [0.0]
   54.0    [54.0]  [54.0]  [54.0]
   2.0     [2.0]   [2.0]   [2.0]
   27.0    [27.0]  [27.0]  [27.0]
   14.0    [14.0]  [14.0]  [14.0]
   4.0     [4.0]   [4.0]   [4.0]
   58.0    [58.0]  [58.0]  [58.0]
   20.0    [20.0]  [20.0]  [20.0]
   39.0    [39.0]  [39.0]  [39.0]
   14.0    [14.0]  [14.0]  [14.0]
   55.0    [55.0]  [55.0]  [55.0]
   2.0     [2.0]   [2.0]   [2.0]
   NaN     [27.6]  [0.0]   [0.0]
   31.0    [31.0]  [31.0]  [31.0]
   NaN     [30]    [0.0]   [0.0]

问题:为什么 KNN 和 SoftImpute 只将 0 归为完成值?

标签: pythonpandasfancyimpute

解决方案


问题是这些是多变量过程,但您只使用一个变量(列)。MICE 执行多元回归,KNN 取 N 个邻居的平均值,它们最接近多维空间中的缺失值(每个维度都是一个变量),我不确定 softImpute 但它也可能是一个多元过程.

例如,查看来自 knn 过程的警告消息:

[KNN] Warning: 3/20 still missing after imputation, replacing with 0

或来自 SoftImpute 的警告:

RuntimeWarning: invalid value encountered in double_scalars
  return (np.sqrt(ssd) / old_norm) < self.convergence_threshold

推荐阅读