首页 > 解决方案 > Keras:图像字幕网络输出恒定的字幕

问题描述

我正在尝试在 Keras 中开发一个图像字幕网络,但是在训练之后,该网络会为每张图像输出相同的字幕。

这是我的模型:

input1 = Input((64, 2048))
input2 = Input(shape = (40,))
encoder = Dense(embedding_dim, activation = 'relu')(input1)
emb = Embedding(vocab_size, embedding_dim, input_length = 40)(input2)
cell = AttentionDecoderCell(units)
decoder = RNN(cell, return_sequences = False)(emb, constants=encoder)

model = Model(inputs = [input1, input2], outputs = decoder)
model.compile(optimizer = Adam(), loss = 'sparse_categorical_crossentropy')

Input1 是 InceptionV3 在 imagenet 上预训练提取的特征向量。Input2 是从前面填充的序列。网络正在尝试预测下一个单词。

大多数事情发生在自定义单元格中

class AttentionDecoderCell(keras.layers.GRUCell):
    def __init__(self, units, **kwargs):
        super(AttentionDecoderCell, self).__init__(units, **kwargs)

    def build(self, input_shape):
        cap_shape = input_shape[0]
        features_shape = input_shape[1]
        self.a_w1 = self.add_weight(
            shape=(self.units, self.units), initializer="glorot_uniform", trainable=True, name = "attention_w1"
        )
        self.a_w2 = self.add_weight(
            shape=(features_shape[2], self.units), initializer="glorot_uniform", trainable=True, name = "attention_w2"
        )
        self.a_v = self.add_weight(
            shape=(self.units, 1), initializer="glorot_uniform", trainable=True, name = "attention_v"
        )
        self.out_w = self.add_weight(
            shape=(self.units, vocab_size), initializer="glorot_uniform", trainable=True, name = "cell_output_weights"
        )
        self.out_b = self.add_weight(shape=(vocab_size,), initializer="zeros", trainable=True, name = "cell_output_bias")

        super(AttentionDecoderCell, self).build(input_shape[0])

    def call(self, inputs, states, constants, training = None):
        state = states[0]
        constant = constants[0]

        state_witht = K.expand_dims(state, 1)
        attention_hidden = K.tanh(K.dot(state_witht, self.a_w1) + K.dot(constant, self.a_w2))
        score = K.dot(attention_hidden, self.a_v)
        score = K.reshape(score, (score.shape.dims[:-1]))
        attention_weights = K.softmax(score)
        attention_weights = K.expand_dims(attention_weights,2)
        context_vector = attention_weights * constant
        context_vector = K.sum(context_vector, axis=1)

        gru_in = K.concatenate([inputs, context_vector], axis=-1)

        output, new_states = super(AttentionDecoderCell, self).call(gru_in, states, training = training)

        output = K.softmax(K.dot(output, self.out_w) + self.out_b)
        return output, new_states
    
    def get_config(self):
        base_config = super(AttentionDecoderCell, self).get_config()
        return base_config

我的训练损失是这样的: tr​​aining loss

当我尝试测试网络时,它为每个图像输出相同的标题,或者换句话说,输出似乎不依赖于输入图像。

我没有在这里粘贴完整的代码,因为它有很多行,但你可以在这里找到它:完整代码

在花了很多时间之后,我不确定哪些部分可能不清楚,所以请询问我是否需要详细解释任何部分。

求救!!!

标签: pythonmachine-learningkerasdeep-learning

解决方案


推荐阅读