首页 > 解决方案 > 使用 sklearn 对我的数据集的一列进行二值化后,结果不正确。代码哪里错了?

问题描述

我预处理一个数据集。我对其中一列进行了二值化。二值化后,我认为这些值不正确。数据有 303 个观察值(行)和 14 个特征(列)。我要二值化的列是最后一列。

这是我的代码的一部分-

    import pandas as pd
    import numpy as np

    #importing the dataset
    header_names = ['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal','num']
    dataset = pd.read_csv('E:/HCU proj doc/EHR dataset/cleveland_data.csv', names= header_names)


    array = dataset.values

    # binarize num
    from sklearn.preprocessing import Binarizer
    x = array[:,13:]
    binarize = Binarizer(threshold=0.0).fit(x)
    transform_binarize = binarize.transform(x)

    array[:,13:]=transform_binarize
    print(transform_binarize)

这是原始数据列的样子-

     0,2,1,0,0.........1,0,3,1,1,2

这是上面代码的输出-

         [[0.]
 [1.]
 [1.]
 [0.]
 [0.]
 [0.]
 [1.]
 [0.]
 [1.]
 [1.]
 [0.]
 [0.]
 [1.]
 [0.]
 [0.]
 [0.]
 [1.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [1.]
 [1.]
 [1.]
 [0.]
 [0.]
 [0.]
 [0.]
 [1.]
 [0.]
 [1.]
 [1.]
 [0.]
 [0.]
 [0.]
 [1.]
 [1.]
 [1.]
 [0.]
 [1.]
 [0.]
 [0.]
 [0.]
 [1.]
 [1.]
 [0.]
 [1.]
 [0.]
 [0.]
 [0.]
 [0.]
 [1.]
 [0.]
 [1.]
 [1.]
 [1.]
 [1.]
 [0.]
 [0.]
 [1.]
 [0.]
 [1.]
 [0.]
 [1.]
 [1.]
 [1.]
 [0.]
 [1.]
 [1.]
 [0.]
 [1.]
 [1.]
 [1.]
 [1.]
 [0.]
 [1.]
 [0.]
 [0.]
 [1.]
 [0.]
 [0.]
 [0.]
 [1.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [1.]
 [0.]
 [0.]
 [0.]
 [1.]
 [1.]
 [1.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [1.]
 [0.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [0.]
 [1.]
 [1.]
 [0.]
 [0.]
 [0.]
 [1.]
 [1.]
 [1.]
 [1.]
 [0.]
 [1.]
 [1.]
 [0.]
 [1.]
 [1.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [1.]
 [1.]
 [1.]
 [0.]
 [0.]
 [1.]
 [0.]
 [1.]
 [0.]
 [1.]
 [1.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [0.]
 [0.]
 [1.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [1.]
 [0.]
 [1.]
 [0.]
 [1.]
 [0.]
 [1.]
 [1.]
 [0.]
 [1.]
 [0.]
 [0.]
 [1.]
 [1.]
 [0.]
 [0.]
 [1.]
 [0.]
 [0.]
 [1.]
 [1.]
 [1.]
 [0.]
 [1.]
 [1.]
 [1.]
 [0.]
 [1.]
 [0.]
 [0.]
 [0.]
 [1.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [1.]
 [1.]
 [1.]
 [0.]
 [1.]
 [0.]
 [1.]
 [0.]
 [1.]
 [1.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [1.]
 [1.]
 [0.]
 [0.]
 [0.]
 [1.]
 [1.]
 [0.]
 [1.]
 [1.]
 [0.]
 [0.]
 [1.]
 [1.]
 [1.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [1.]
 [0.]
 [1.]
 [1.]
 [1.]
 [1.]
 [0.]
 [0.]
 [1.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [1.]
 [0.]
 [1.]
 [0.]
 [0.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [0.]
 [1.]
 [0.]
 [1.]
 [0.]
 [1.]
 [0.]
 [0.]
 [0.]
 [1.]
 [0.]
 [1.]
 [0.]
 [1.]
 [0.]
 [1.]
 [1.]
 [1.]
 [0.]
 [0.]
 [0.]
 [1.]
 [0.]
 [1.]
 [1.]
 [1.]
 [0.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [0.]]

我认为最后一个是不正确的。我不明白为什么会这样。

标签: pythonmachine-learningscikit-learndata-science

解决方案


如果我假设这是从这个 UCI 存储库获取的心脏病数据集并且 csv 文件是这个是正确的,那么在这种情况下,这些是二值化器的正确值。您使用的原始数据列0在最后一行有一个,我想您错过了,试试这个代码

for idx in range(0,len(x)):
    print idx,x[idx],transform_binarize[idx]

输出

278 [1L] [1.]
279 [0L] [0.]
280 [2L] [1.]
281 [0L] [0.]
282 [3L] [1.]
283 [0L] [0.]
284 [2L] [1.]
285 [4L] [1.]
286 [2L] [1.]
287 [0L] [0.]
288 [0L] [0.]
289 [0L] [0.]
290 [1L] [1.]
291 [0L] [0.]
292 [2L] [1.]
293 [2L] [1.]
294 [1L] [1.]
295 [0L] [0.]
296 [3L] [1.]
297 [1L] [1.]
298 [1L] [1.]
299 [2L] [1.]
300 [3L] [1.]
301 [1L] [1.]
302 [0L] [0.]     #<--- I think you missed this row while reading your dataset

如果您尝试此代码,那么您会发现二值化器完全按照应有的方式工作。


推荐阅读