首页 > 解决方案 > 使用 Keras 的句子级编码

问题描述

我的任务是使用 Keras 为文本相似度建立一个监督模型。我的输入是一对长格式文本,目标是 0 或 1。我正在尝试对每个文本的句子中的单词进行编码。目的是从文档结构的不同级别获取知识。为此,我首先将文本拆分为句子列表,然后进行标记化。我还修复了 max_sentence_length(文本中的最大句子数)和 max_sequence_length(一个句子中的最大单词数)。下面是我的预处理代码:

input_text1 = [sen.split('.') for sen in input_text1]
input_text2 = [sen.split('.') for sen in input_text2]

max_words = 20000 (vocab size)
max_sequence_length = 500
max_sentence_length = 100
emb_dim = 50
n_classes = 2
latent_dim = 128
lr = 0.001
epochs = 20
batch_size = 128

tokenizer = Tokenizer(num_words=max_words, filters='!"#$%&()*+,-.:;=?@[\\]^_`{|}~\t\n', lower=True, split=' ', oov_token="UNK")
encoder = LabelEncoder()
tr_sent1, te_sent1, tr_sent2, te_sent2, tr_rel, te_rel = train_test_split(input_text1, input_text2, similarity, test_size=0.2, stratify=similarity)
input_text_sen1 = tr_sent1[:2]
input_text_sen2 = tr_sent2[:2]
tr_rel = tr_rel[:2]
for sen in input_text_sen1 + input_text_sen2:
    tokenizer.fit_on_texts(sen)
encoder.fit(tr_rel)
tokenizer.word_index = {e: i for e, i in tokenizer.word_index.items() if i <= max_words}
tokenizer.word_index[tokenizer.oov_token] = max_words + 1
seqs1 = []
for sen in input_text_sen1:
    tmp = tokenizer.texts_to_sequences(sen)
    seqs1.append(tmp)

# to fix the number of sentences

seqs1_fixed = []
for sen in seqs1:
    sen = sen[:max_sentence_length]
    seqs1_fixed.append(sen)
seqs1_fixed = [pad_sequences(sen, maxlen=max_sequence_length,
                             value=0, padding='post', truncating='post') for sen in seqs1_fixed]

seqs2 = []
for sen in input_text_sen2:
    tmp = tokenizer.texts_to_sequences(sen)
    seqs2.append(tmp)

seqs2_fixed = []
for sen in seqs2:
    sen = sen[:max_sentence_length]
    seqs2_fixed.append(sen)

seqs2_fixed = [pad_sequences(sen, maxlen=max_sequence_length,
                             value=0, padding='post', truncating='post') for sen in seqs2_fixed]
categorical_y = encoder.transform(tr_rel) 

下面是构建模型的代码:

bilstm = Bidirectional(LSTM(units=latent_dim, return_sequences=True, dropout=0.2, recurrent_dropout=0.2))

encoder_input1 = [Input(shape=(max_sentence_length,), name='desc_word_' + str(i + 1)) for i in range(max_sequence_length)] # list with each shape=(None, 100)
text_embedding_input1 = Embedding(input_dim=max_words+2, output_dim=emb_dim, input_length=max_sentence_length)  # list with each shape=(None, 100, 50)
embedding_input1 = [text_embedding_input1(inp) for inp in encoder_input1]
words_concat1 = concatenate(embedding_input1, axis=-1) # shape=(None, 100, 25000)
bilstm_out1 = bilstm(words_concat1) # shape=(None, 100, 256)

encoder_input2 = [Input(shape=(max_sentence_length, ), name='desc_word_' + str(i + 1)) for i in range(max_sequence_length)]
text_embedding_input2 = Embedding(input_dim=max_words+2, output_dim=emb_dim, input_length=max_sentence_length)
embedding_input2 = [text_embedding_input2(inp) for inp in encoder_input2]
words_concat2 = concatenate(embedding_input2, axis=-1)
bilstm_out2 = bilstm(words_concat2)

x1 = attention()(bilstm_out1) # shape=(None, 256)
x1 = Dropout(0.2)(x1)
x2 = attention()(bilstm_out2) # shape=(None, 256)
x2 = Dropout(0.2)(x2)
x = concatenate([x1, x2]) # shape=(None, 512)
out = Dense(units=n_classes, activation="softmax", kernel_regularizer=l2(0.01), bias_regularizer=l2(0.01))(x) # shape=(None, 2)
model = Model([encoder_input1 + encoder_input2], out)

现在我在运行模型时遇到以下错误:

ValueError: Input tensors to a Model must come from `keras.layers.Input`. Received: [<tf.Tensor 'desc_word_1_16:0' shape=(None, 100) dtype=float32>, <tf.Tensor 'desc_word_2_16:0' shape=(None, 100) dtype=float32>, <tf.Tensor 'desc_word_3_16:0'....

请告知导致此错误的原因。我坚持了很长时间。此外,也欢迎提出解决此问题的替代建议。

标签: kerasdeep-learningpython-3.6lstmattention-model

解决方案


在这行代码中:model = Model([encoder_input1 + encoder_input2], out),

将其修改为model = Model(inputs=[encoder_input1] + [encoder_input2], outputs=out)


推荐阅读