python - 行尺寸不兼容
问题描述
任务是对所有文本和分类特征进行编码,并再次将它们组合以形成数据矩阵,但得到错误不兼容的行维度。
到目前为止我的工作:
使用标签编码器对分类特征进行编码
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
enc.fit(x_train[' Round'])
round_train_le = enc.transform(x_train[' Round'])
round_test_le = enc.transform(x_test[' Round'])
使用 TfIdfVectorizer 对文本特征类别进行编码
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer1 = TfidfVectorizer(max_features=500)
vectorizer1.fit(x_train[' Category'])
category_train_enc = vectorizer1.transform(x_train[' Category'])
category_test_enc = vectorizer1.transform(x_test[' Category'])
print(category_train_enc.shape)
print(category_test_enc.shape)
使用 TfIdfVectorizer 编码文本特征问题
vectorizer2 = TfidfVectorizer(max_features=5000)
vectorizer2.fit(x_train[' Question'])
question_train_enc = vectorizer2.transform(x_train[' Question'])
question_test_enc = vectorizer2.transform(x_test[' Question'])
print(question_train_enc.shape)
print(question_test_enc.shape)
使用 TfIdfVectorizer 编码文本特征答案
vectorizer3 = TfidfVectorizer(max_features=1000)
vectorizer3.fit(x_train[' Answer'])
answer_train_enc = vectorizer3.transform(x_train[' Answer'])
answer_test_enc = vectorizer3.transform(x_test[' Answer'])
print(answer_train_enc.shape)
print(answer_test_enc.shape)
结合编码特征:
from scipy.sparse import hstack
x_tr = hstack((round_train_le, category_train_enc, question_train_enc, answer_train_enc))
x_te = hstack((round_test_le, category_test_enc, question_test_enc, answer_test_enc))
print("Final Data matrix")
print(x_tr.shape, y_train.shape)
print(x_te.shape, y_test.shape)
然后我收到以下错误:
ValueError Traceback (most recent call last)
<ipython-input-60-12e131ba4df4> in <module>
1 # merge two sparse matrices: https://stackoverflow.com/a/19710648/4084039
2 from scipy.sparse import hstack
----> 3 x_tr = hstack((round_train_le, category_train_enc, question_train_enc, answer_train_enc))
4 x_te = hstack((round_test_le, category_test_enc, question_test_enc, answer_test_enc))
5
~\anaconda3\lib\site-packages\scipy\sparse\construct.py in hstack(blocks, format, dtype)
463
464 """
--> 465 return bmat([blocks], format=format, dtype=dtype)
466
467
~\anaconda3\lib\site-packages\scipy\sparse\construct.py in bmat(blocks, format, dtype)
584 exp=brow_lengths[i],
585 got=A.shape[0]))
--> 586 raise ValueError(msg)
587
588 if bcol_lengths[j] == 0:
ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,1].shape[0] == 145341, expected 1.
请建议我需要在代码中进行哪些更改以解决错误。
解决方案
使用时,scipy.sparse.hstack()
您必须确保您尝试堆叠的所有元素都具有相同的 0 维度,即相同的行数。请参见以下示例:
import numpy as np
from scipy.sparse import hstack
a = np.array([1, 2, 3, 4, 5])
b = np.array([1, 2, 3, 5])
c = hstack([a, b])
print(c)
输出:
(0, 0) 1
(0, 1) 2
(0, 2) 3
(0, 3) 4
(0, 4) 5
(0, 5) 1
(0, 6) 2
(0, 7) 3
(0, 8) 5
另一方面,当行数不匹配时 - 它会导致您收到错误:
import numpy as np
from scipy.sparse import hstack
a = np.array([1, 2, 3, 4, 5, 6])
b = np.array([[1, 2, 3], [4, 5, 6]])
c = hstack([a, b])
print(c)
输出:
ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,1].shape[0] == 1, expected 2.
因此,您应该检查所有项目是否具有相同的行数以逐行加入它们
干杯。
推荐阅读
- java - 如何使用正则表达式验证二进制分数?
- kubernetes - 如何手动调整 IBM Cloud File / Block Storage 支持的 PVC 的大小?
- django - Django Forms - 如果存在则获取对象 - 唯一字段 - M2M
- c# - iOs地图和android地图之间的折线有什么区别?
- python - 我正在运行我的 Kivy 程序的一半
- reactjs - Material-ui 类名称在构建时更改,为每个由用户覆盖的类名称添加标识符
- apache-storm - 在 Apache Storm 中的 Bolt 中从多个流中进行字段分组
- selenium - 按 ID 定位元素时异常方法不正确 - Selenium
- android - 尝试调试 android 即时应用程序时如何连接到调试进程?
- r - 使用从 R 中 tidyverse 'map' 的输出中提取的 lm 使用 'segmented' 时出错