首页 > 解决方案 > 如何将 2D numpy 数组转换为 One Hot Encoding?

问题描述

我试图对以下数据应用一种热编码。但我对输出感到困惑。在应用一种热编码之前,数据的形状是 (5,10),在应用一种热编码之后,数据的形状是 (5,20)。但是每个字母都会被编码为一个 4 元素。因此,在应用一种热编码后,形状应该是 (5, 40) 而不是 (5,10)。我该如何解决这个问题?

X = [[‘A’, ‘G’, ‘T’, ‘G’, ‘T’, ‘C’, ‘T’, ‘A’, ‘A’, ‘C’],
     [‘A’, ‘G’, ‘T’, ‘G’, ‘T’, ‘C’, ‘T’, ‘A’, ‘A’, ‘C’],
     [‘G’, ‘C’, ‘C’, ‘A’, ‘C’, ‘T’, ‘C’, ‘G’, ‘G’, ‘T’],
     [‘G’, ‘C’, ‘C’, ‘A’, ‘C’, ‘T’, ‘C’, ‘G’, ‘G’, ‘T’],
     [‘G’, ‘C’, ‘C’, ‘A’, ‘C’, ‘T’, ‘C’, ‘G’, ‘G’, ‘T’]]
Y = np.array(X)
print('Shape of numpy array', Y.shape)

# one hot encoding

onehot_encoder = OneHotEncoder(sparse=False)
onehot_encoded = onehot_encoder.fit_transform(Y)
print(onehot_encoded)
print('Shape of one hot encoding', onehot_encoded.shape)


Output:

Shape of numpy array (5, 10)
[[1. 0. 0. 1. 0. 1. 0. 1. 0. 1. 1. 0. 0. 1. 1. 0. 1. 0. 1. 0.]
 [1. 0. 0. 1. 0. 1. 0. 1. 0. 1. 1. 0. 0. 1. 1. 0. 1. 0. 1. 0.]
 [0. 1. 1. 0. 1. 0. 1. 0. 1. 0. 0. 1. 1. 0. 0. 1. 0. 1. 0. 1.]
 [0. 1. 1. 0. 1. 0. 1. 0. 1. 0. 0. 1. 1. 0. 0. 1. 0. 1. 0. 1.]
 [0. 1. 1. 0. 1. 0. 1. 0. 1. 0. 0. 1. 1. 0. 0. 1. 0. 1. 0. 1.]]
 Shape of one hot encoding (5, 20)

标签: pythonscikit-learnnumpy-ndarrayone-hot-encoding

解决方案


您需要单独对每一列进行一次热编码,以便为 ndarray 中的每一列获得 4 个新列:

X = np.array(X)

# Get unique classes.
classes = np.unique(X)

# Replace classes with itegers.
X = np.searchsorted(classes, X)

# Get an identity matrix.
eye = np.eye(classes.shape[0])

# Iterate over all columns
# and get one-hot encoding for each column.
X = np.concatenate([eye[i] for i in X.T], axis=1)

X.shape
# (5, 40)

考虑以下示例:

[['A', 'G'],
 ['C', 'C'],
 ['T', 'A']]

您将在 one-hot 编码的 ndarray 中获得 8 (2 x 4) 列:

  Column 0      Column 1         
 A  C  G  T    A  C  G  T

 1  0  0  0    0  0  1  0
 0  1  0  0    0  1  0  0
 0  0  0  1    1  0  0  0

推荐阅读