首页 > 解决方案 > 多类文本分类 TypeError: Input must be a SparseTensor

问题描述

我正在尝试建立一个深度学习模型来进行文本分类。但是,当我运行下面的脚本时,我遇到了这个错误。

InvalidArgumentError: indices[2] = [0,398] is out of order. Many sparse ops require sorted indices. Use `tf.sparse.reorder` to create a correctly ordered copy.

但是,当我尝试使用时tf. sparse. reorder,我遇到了这个错误,上面写着TypeError: Input must be a SparseTensor.

这些是输入的维度

X_train_cv1.shape, y_train.shape, X_validation_cv1.shape, y_validation.shape
((13435, 675), (13435, 3), (3359, 675), (3359, 3))

有没有办法纠正这个问题?

# Split the data into training and test sets
from sklearn.model_selection import train_test_split
X_train, X_validation, y_train, y_validation = train_test_split(X, y, test_size=0.2, random_state=42)

# encode class values as integers
encoder = LabelEncoder()
encoder.fit(y_train)
encoded_y_train = encoder.transform(y_train)
# convert integers to dummy variables (i.e. one hot encoded)
y_train= np_utils.to_categorical(encoded_y_train)

# encode class values as integers
encoder = LabelEncoder()
encoder.fit(y_validation)
encoded_y_validation = encoder.transform(y_validation)
# convert integers to dummy variables (i.e. one hot encoded)
y_validation= np_utils.to_categorical(encoded_y_validation)

# The first document-term matrix has default Count Vectorizer values - counts of bigrams
from sklearn.feature_extraction.text import CountVectorizer

cv1 = CountVectorizer(analyzer='char',ngram_range=(2, 2))

X_train_cv1 = cv1.fit_transform(X_train)
X_validation_cv1  = cv1.transform(X_validation)

input_dim = X_train_cv1.shape[1]  # Number of features
model = Sequential()
model.add(layers.Dense(10, input_dim=input_dim, activation='relu'))
model.add(layers.Dense(3, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

X_train_cv1 = tf.sparse.reorder(X_train_cv1)
y_train = tf.sparse.reorder(y_train)
X_validation_cv1 = tf.sparse.reorder(X_validation_cv1)
y_validation = tf.sparse.reorder(y_validation)

history = model.fit(X_train_cv1, y_train,epochs=100,verbose=True,validation_data=(X_validation_cv1, y_validation),batch_size=10)

这是我的数据集

在此处输入图像描述

标签: python-3.xtensorflowscikit-learndeep-learningtensorflow2.0

解决方案


好的,我设法找到了答案。显然 Keras 不能很好地处理稀疏数组,所以我只需要将这个编辑包含到我的代码行中以使其成为一个数组。

X_train_cv1 = cv1.fit_transform(X_train).toarray()
X_validation_cv1  = cv1.transform(X_validation).toarray()

推荐阅读