首页 > 解决方案 > 行尺寸不兼容

问题描述

任务是对所有文本和分类特征进行编码,并再次将它们组合以形成数据矩阵,但得到错误不兼容的行维度。

到目前为止我的工作:

使用标签编码器对分类特征进行编码

from sklearn.preprocessing import LabelEncoder

enc = LabelEncoder()

enc.fit(x_train[' Round'])

round_train_le = enc.transform(x_train[' Round'])
round_test_le = enc.transform(x_test[' Round'])

使用 TfIdfVectorizer 对文本特征类别进行编码

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer1 = TfidfVectorizer(max_features=500)

vectorizer1.fit(x_train[' Category'])

category_train_enc = vectorizer1.transform(x_train[' Category'])
category_test_enc = vectorizer1.transform(x_test[' Category'])

print(category_train_enc.shape)
print(category_test_enc.shape)

使用 TfIdfVectorizer 编码文本特征问题

vectorizer2 = TfidfVectorizer(max_features=5000)

vectorizer2.fit(x_train[' Question'])

question_train_enc = vectorizer2.transform(x_train[' Question'])
question_test_enc = vectorizer2.transform(x_test[' Question'])

print(question_train_enc.shape)
print(question_test_enc.shape)

使用 TfIdfVectorizer 编码文本特征答案

vectorizer3 = TfidfVectorizer(max_features=1000)

vectorizer3.fit(x_train[' Answer'])

answer_train_enc = vectorizer3.transform(x_train[' Answer'])
answer_test_enc = vectorizer3.transform(x_test[' Answer'])

print(answer_train_enc.shape)
print(answer_test_enc.shape)

结合编码特征:

from scipy.sparse import hstack
x_tr = hstack((round_train_le, category_train_enc, question_train_enc, answer_train_enc))
x_te = hstack((round_test_le, category_test_enc, question_test_enc, answer_test_enc))

print("Final Data matrix")
print(x_tr.shape, y_train.shape)
print(x_te.shape, y_test.shape)

然后我收到以下错误:

ValueError                                Traceback (most recent call last)
<ipython-input-60-12e131ba4df4> in <module>
      1 # merge two sparse matrices: https://stackoverflow.com/a/19710648/4084039
      2 from scipy.sparse import hstack
----> 3 x_tr = hstack((round_train_le, category_train_enc, question_train_enc, answer_train_enc))
      4 x_te = hstack((round_test_le, category_test_enc, question_test_enc, answer_test_enc))
      5 

~\anaconda3\lib\site-packages\scipy\sparse\construct.py in hstack(blocks, format, dtype)
    463 
    464     """
--> 465     return bmat([blocks], format=format, dtype=dtype)
    466 
    467 

~\anaconda3\lib\site-packages\scipy\sparse\construct.py in bmat(blocks, format, dtype)
    584                                                     exp=brow_lengths[i],
    585                                                     got=A.shape[0]))
--> 586                     raise ValueError(msg)
    587 
    588                 if bcol_lengths[j] == 0:

ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,1].shape[0] == 145341, expected 1.

请建议我需要在代码中进行哪些更改以解决错误。

标签: pythonnumpyscikit-learnscipy

解决方案


使用时,scipy.sparse.hstack()您必须确保您尝试堆叠的所有元素都具有相同的 0 维度,即相同的行数。请参见以下示例:

import numpy as np
from scipy.sparse import hstack

a = np.array([1, 2, 3, 4, 5])
b = np.array([1, 2, 3, 5])

c = hstack([a, b])
print(c)

输出:

 (0, 0) 1
  (0, 1)    2
  (0, 2)    3
  (0, 3)    4
  (0, 4)    5
  (0, 5)    1
  (0, 6)    2
  (0, 7)    3
  (0, 8)    5

另一方面,当行数不匹配时 - 它会导致您收到错误:

import numpy as np
from scipy.sparse import hstack

a = np.array([1, 2, 3, 4, 5, 6])
b = np.array([[1, 2, 3], [4, 5, 6]])

c = hstack([a, b])
print(c)

输出:

ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,1].shape[0] == 1, expected 2.

因此,您应该检查所有项目是否具有相同的行数以逐行加入它们

干杯。


推荐阅读