首页 > 解决方案 > 带注意的 Keras Seq2seq 双向 LSTM 中的损失没有减少

问题描述

任何人都可以看到为什么这个模型中的损失没有减少?

我试图在 Andrew Ng 的深度学习专业结束时将双向 LSTM 与注意力模型集成(https://www.coursera.org/learn/nlp-sequence-models/notebook/npjGi/neural-machine-translation-with -attention),但由于某种原因,模型似乎没有收敛

我在 Google Colab 上运行它

该网络将两个张量作为形状的输入:

encoder_input_data[m, 262, 28]
decoder_target_data[m, 28, 28]

输出是 27 个 oneHot 向量的列表

oneHot 向量的长度为 28:

(字母表中的 26 个字符 + endkey + startkey)

*整体结构为:*

0)输入[262,28] ->

1)编码器:双向LSTM ->

2)向后和向前连接成encoder_outputs->

3)解码器 LSTM + 注意力 ->

___* 将先前解码的状态 s 与编码器的每个 a(t) 连接起来

___*让它通过两个 Dense 层并计算 alphas

___*总结每个 alpha(a(t))*a(t)

4)Softmax层并得到结果

from keras.models import Model
from keras.layers import merge, Input, LSTM, Dense, Bidirectional, concatenate, Concatenate
from keras.layers import RepeatVector, Activation, Permute, Dot, Input, Multiply
from keras.callbacks import EarlyStopping, ModelCheckpoint
from keras.activations import softmax
from textwrap import wrap
import re
import random
import string
import numpy as np
import copy
from google.colab import files
from google.colab import drive

#drive.mount('/content/drive')
#files.upload()

    #returnData() creates 3 vectors:
    #encoder_input_data[m, 262, 28]
    #decoder_input_data[m, 28, 28] <- not used for now
    #decoder_i_data[m, 28, 28]

#special softmax needed for the attention layer
def softMaxAxis1(x):
    return softmax(x,axis=1)

#layers needed for the attention
repeator = RepeatVector(262)
concatenator = Concatenate(axis=-1)
densor1 = Dense(10, activation = "tanh")
densor2 = Dense(1, activation = "relu")
activator = Activation(softMaxAxis1, name='attention_weights')
dotor = Dot(axes = 1)

#compute one timestep of attention
#repeat s(t-1) for all the a(t) so far and concatenate them so that
#the algorithm can select the old a(t) based on current s
#let a dense layer compute the energies and a softmax decide
def one_step_attention(a, s_prev):
    s_prev = repeator(s_prev)
    concat = concatenator([a, s_prev])
    e = densor1(concat)
    energies = densor2(e)
    alphas = activator(energies)
    context = dotor([alphas, a])    
    return context

#variables needed for the model
encoder_input_data, decoder_input_data, decoder_target_data = returnData()
batch_size = 64
epochs = 50
latent_dim = 128
num_samples = 1000
num_tokens = 28
Tx = 262

#encoder part with a bi-LSTM with dropout
encoder_inputs = Input(shape=(Tx, num_tokens))
encoder = Bidirectional(LSTM(latent_dim, return_sequences=True ,dropout = .7))
encoder_outputs = encoder(encoder_inputs)

#decoder part with a regular LSTM
decoder_lstm = LSTM(latent_dim*2, return_state=True)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')

#initialize the parameters needed for computing attention alphas 
s0 = Input(shape=(latent_dim*2,))
c0 = Input(shape=(latent_dim*2,))
s = s0
c = c0
outputs=[]

#run attention for each target timestep Ty
for t in range(num_tokens-1):
        context = one_step_attention(encoder_outputs, s)
        s, _, c = decoder_lstm(context, initial_state = [s, c])
        out = decoder_dense(s)
        outputs.append(out)

#define the model and connect the graph
model = Model([encoder_inputs, s0, c0], outputs)

#select optimizer, loss, early_stopping
model.compile(optimizer='adam', loss='categorical_crossentropy')
keras_callbacks = [EarlyStopping(monitor='val_loss', patience=30)]

#prepare empty arrays for s0 c0 and put target data in the same for of outputs
s0 = np.zeros((encoder_input_data.shape[0], latent_dim*2))
c0 = np.zeros((encoder_input_data.shape[0], latent_dim*2))
outputs = list(decoder_target_data.swapaxes(0,1))

#fit the model with the expected dimensions of input/output
model.fit(
    [encoder_input_data, s0, c0], 
    outputs,
    batch_size=batch_size,
    epochs=epochs,
    validation_split=0.1
)

#save and download the model
model.save('s2s.h5')
files.download("s2s.h5")

训练时,我得到以下结果:

Train on 11671 samples, validate on 1297 samples
Epoch 1/50
11671/11671 [==============================] - 719s 62ms/step - loss: 86.7096 - dense_77_loss: 1.0157 - val_loss: 85.6579 - val_dense_77_loss: 0.5682
Epoch 2/50
11671/11671 [==============================] - 672s 58ms/step - loss: 87.6775 - dense_77_loss: 2.0322 - val_loss: 88.3077 - val_dense_77_loss: 2.4503
Epoch 3/50
11671/11671 [==============================] - 670s 57ms/step - loss: 86.1718 - dense_77_loss: 0.6686 - val_loss: 85.1008 - val_dense_77_loss: 0.1771
Epoch 4/50
11671/11671 [==============================] - 666s 57ms/step - loss: 85.1310 - dense_77_loss: 0.1196 - val_loss: 84.8357 - val_dense_77_loss: 0.0205
Epoch 5/50
11671/11671 [==============================] - 666s 57ms/step - loss: 84.7977 - dense_77_loss: 0.0173 - val_loss: 84.7414 - val_dense_77_loss: 0.0072
Epoch 6/50
11671/11671 [==============================] - 655s 56ms/step - loss: 87.8612 - dense_77_loss: 2.4636 - val_loss: 87.3005 - val_dense_77_loss: 1.3145
Epoch 7/50
11671/11671 [==============================] - 662s 57ms/step - loss: 88.1340 - dense_77_loss: 2.5091 - val_loss: 89.6831 - val_dense_77_loss: 4.6627
Epoch 8/50
11671/11671 [==============================] - 666s 57ms/step - loss: 88.2948 - dense_77_loss: 2.6113 - val_loss: 86.4465 - val_dense_77_loss: 0.1490
Epoch 9/50
11671/11671 [==============================] - 666s 57ms/step - loss: 87.3295 - dense_77_loss: 1.8405 - val_loss: 85.1743 - val_dense_77_loss: 0.1448
Epoch 10/50
11671/11671 [==============================] - 661s 57ms/step - loss: 85.0535 - dense_77_loss: 0.1180 - val_loss: 84.8204 - val_dense_77_loss: 0.0236
Epoch 11/50
11671/11671 [==============================] - 662s 57ms/step - loss: 84.7884 - dense_77_loss: 0.0179 - val_loss: 84.7479 - val_dense_77_loss: 0.0050
Epoch 12/50
11671/11671 [==============================] - 665s 57ms/step - loss: 87.0466 - dense_77_loss: 1.9977 - val_loss: 89.4181 - val_dense_77_loss: 4.4239
Epoch 13/50
 1216/11671 [==>...........................] - ETA: 9:34 - loss: 89.7864 - dense_77_loss: 4.8242

帮助将不胜感激!

标签: pythonkeraslstmbidirectionalseq2seq

解决方案


推荐阅读