首页 > 解决方案 > 深度学习模型在第一个 epoch 后提示错误

问题描述

我正在尝试训练一个二元分类模型。这是对推文的情绪分析,但模型在 epoch 1 后提示错误。必须是输入的大小,但无法准确找出可能导致问题的输入。任何帮助是极大的赞赏。

非常感谢!

我已经尝试了许多不同大小的实例,但问题仍然存在,

import pandas as pd
import os
import numpy as np
from sklearn.model_selection import train_test_split
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense


df = pd.read_csv('twitter-sentiment-analysis2/train.csv',encoding='latin-1')
df.drop(['ItemID'], axis=1, inplace=True)
label=list(df.Sentiment)
text=list(df.SentimentText)
tokenizer = Tokenizer(filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',lower=True,split=' ')
tokenizer.fit_on_texts(text)
vocab = tokenizer.word_index
X_train, X_test, y_train, y_test = train_test_split(text, label, test_size=0.1,random_state=42)

X_train_word_ids = tokenizer.texts_to_sequences(X_train)
X_test_word_ids = tokenizer.texts_to_sequences(X_test)
x_train = pad_sequences(X_train_word_ids, maxlen=50)
x_test= pad_sequences(X_test_word_ids, maxlen=50)

glove_dir = 'glove6b100dtxt/'
embeddings_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))


embedding_dim = 100 #data comes from my GloVe
max_words=50
maxlen=50
embedding_matrix = np.zeros((max_words, embedding_dim))
for word, i in vocab.items():
    embedding_vector = embeddings_index.get(word)
    if i < max_words:
        if embedding_vector is not None:
            # Words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector

model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=maxlen))
model.add(Flatten())
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.summary()
model.layers[0].set_weights([embedding_matrix])
model.layers[0].trainable = False
model.compile(optimizer='rmsprop',loss='binary_crossentropy',metrics=['acc'])
history = model.fit(x_train, y_train,epochs=10,batch_size=32,validation_split=0.1,shuffle=True)
model.save_weights('pre_trained_glove_model.h5')

有人可以给我一些关于在哪里看的建议吗?再次感谢!

这是错误:

File "HM3.py", line 58, in <module>
    history = model.fit(x_train, y_train,epochs=10,batch_size=32,validation_split=0.1,shuffle=True)
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/training.py", line 1039, in fit
    validation_steps=validation_steps)
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/training_arrays.py", line 199, in fit_loop
    outs = f(ins_batch)
  File "/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py", line 2715, in __call__
    return self._call(inputs)
  File "/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py", line 2675, in _call
    fetched = self._callable_fn(*array_vals)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1439, in __call__
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/errors_impl.py", line 528, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[26,39] = 31202 is not in [0, 50)
     [[{{node embedding_1/embedding_lookup}} = GatherV2[Taxis=DT_INT32, Tindices=DT_INT32, Tparams=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](embedding_1/embeddings/read, embedding_1/Cast, embedding_1/embedding_lookup/axis)]]

标签: numpykerasdeep-learning

解决方案


max_words=50
...
model.add(Embedding(max_words, embedding_dim, input_length=maxlen))

您创建了一个只能嵌入 50 个不同单词的 Embedding,但在您的训练数据中,您索引了所有出现的单词。该错误告诉您在大小 [0, 50) 的嵌入中找不到索引为 31202 的单词。

一种解决方案是扩大嵌入输入以覆盖训练集中出现的所有单词。另一种方法是使用具有零嵌入的零索引,并将所有索引 >= 50 的训练词重新映射到该零。


推荐阅读