首页 > 解决方案 > How to prevent LabelEncoder from sorting label values?

问题描述

Scikit LabelEncoder is showing some puzzling behavior in my Jupyter Notebook, as in:

from sklearn.preprocessing import LabelEncoder
le2 = LabelEncoder()
le2.fit(['zero', 'one'])
print (le2.inverse_transform([0, 0, 0, 1, 1, 1]))

prints ['one' 'one' 'one' 'zero' 'zero' 'zero']. This is odd, shouldn't it print ['zero' 'zero' 'zero' 'one' 'one' 'one']? Then I tried

le3 = LabelEncoder()
le3.fit(['one', 'zero'])
print (le3.inverse_transform([0, 0, 0, 1, 1, 1]))

which also prints ['one' 'one' 'one' 'zero' 'zero' 'zero']. Perhaps there was an alphabetization thing happening? Next, I tried

le4 = LabelEncoder()
le4.fit(['nil', 'one'])
print (le4.inverse_transform([0, 0, 0, 1, 1, 1]))

which prints ['nil' 'nil' 'nil' 'one' 'one' 'one']!

I've spent several hours on this. FWIW, the example in the documentation works as expected so I suspect there is a flaw in how I expect inverse_transform to work. Part of my research included this and this.

In case it is relevant, I'm using iPython 7.7.0, numpy 1.17.3 and scikit-learn version 0.21.3.

标签: pythonscikit-learn

解决方案


事情是 LabelEncoder.fit() 总是返回排序的数据。那是因为它使用np.uniqueHere's the source code

我想做你想做的唯一方法是创建你自己的fit方法并覆盖来自 LabelEncoder 的原始方法。

您只需要重用链接中给出的现有代码,这是示例:

import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.utils import column_or_1d

class MyLabelEncoder(LabelEncoder):

    def fit(self, y):
        y = column_or_1d(y, warn=True)
        self.classes_ = pd.Series(y).unique()
        return self

le2 = MyLabelEncoder()
le2.fit(['zero', 'one'])
print (le2.inverse_transform([0, 0, 0, 1, 1, 1]))

给你:

['zero' 'zero' 'zero' 'one' 'one' 'one']

推荐阅读