首页 > 解决方案 > TimeDistributed 与 LSTM 在关键字检测器中

问题描述

我正在研究一个关键字检测器,它处理音频输入并根据类似于此处显示的语音命令列表返回音频类:https ://www.tensorflow.org/tutorials/audio/simple_audio

我希望能够处理多帧音频,而不是只处理 1 秒的音频作为输入,比如 5 个时间步长和 10 毫秒的步长,并将它们输入机器学习模型。

本质上,这相当于TimeDistributed在我的网络之上添加了一层。我要做的第二件事是在将我的隐藏层映射到输出类的密集层之前添加一个 LSTM 层。

我的问题:如何有效地更改下面的代码以添加一个TimeDistributed需要多个时间步骤的层和一个 LSTM 层。

启动代码:

model = models.Sequential([
    layers.Input(shape=input_shape),
    preprocessing.Resizing(32, 32), 
    norm_layer,
    layers.Conv2D(32, 3, activation='relu'),
    layers.Conv2D(64, 3, activation='relu'),
    layers.MaxPooling2D(),
    layers.Dropout(0.25),
    layers.Flatten(),
    layers.Dense(128, activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(num_labels),
])

型号总结:

Input shape: (124, 129, 1)
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
resizing (Resizing)          (None, 32, 32, 1)         0         
_________________________________________________________________
normalization (Normalization (None, 32, 32, 1)         3         
_________________________________________________________________
conv2d (Conv2D)              (None, 30, 30, 32)        320       
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 28, 28, 64)        18496     
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 14, 14, 64)        0         
_________________________________________________________________
dropout (Dropout)            (None, 14, 14, 64)        0         
_________________________________________________________________
flatten (Flatten)            (None, 12544)             0         
_________________________________________________________________
dense (Dense)                (None, 128)               1605760   
_________________________________________________________________
dropout_1 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 8)                 1032      
=================================================================
Total params: 1,625,611
Trainable params: 1,625,608
Non-trainable params: 3
_________________________________________________________________

尝试1:添加一个 LSTM 层

model = models.Sequential([
    layers.Input(shape=input_shape),
    preprocessing.Resizing(32, 32), 
    norm_layer,
    layers.Conv2D(32, 3, activation='relu'),
    layers.Conv2D(64, 3, activation='relu'),
    layers.MaxPooling2D(),
    layers.Dropout(0.25),
    layers.Flatten(),
    layers.Dense(128, activation='relu'),
    layers.Dropout(0.5),
    layers.Flatten(),
    layers.LSTM(32, activation='relu', input_shape=(1,128,98)),
    layers.Dense(num_labels),
])

错误:ValueError: Input 0 of layer lstm_5 is incompatible with the layer: expected ndim=3, found ndim=2. Full shape received: [None, 128]

Attempt2:添加 TimeDistributed 层:

model = models.Sequential([
    layers.Input(shape=input_shape),
    preprocessing.Resizing(32, 32), 
    norm_layer,
    TimeDistributed(layers.Conv2D(32, 3, activation='relu'), input_shape=(None, 32, 32, 1)),
    TimeDistributed(layers.Conv2D(64, 3, activation='relu'), input_shape=(None, 30, 30, 1)),
    TimeDistributed(layers.MaxPooling2D()),
    TimeDistributed(layers.Dropout(0.25)),
    TimeDistributed(layers.Flatten()),
    TimeDistributed(layers.Dense(128, activation='relu')),
    TimeDistributed(layers.Dropout(0.5)),
    TimeDistributed(layers.Flatten()),
    layers.Dense(num_labels),
])

错误:ValueError: Input 0 of layer conv2d_43 is incompatible with the layer: : expected min_ndim=4, found ndim=3. Full shape received: [None, 32, 1]

我知道我的尺寸有问题。我不确定如何进行。

标签: pythontensorflowkeraslstmspeech-recognition

解决方案


LSTM层需要输入:具有形状的 3D 张量[batch, timesteps, feature] 示例代码片段

import tensorflow as tf
inputs = tf.random.normal([32, 10, 8])
lstm = tf.keras.layers.LSTM(4)
output = lstm(inputs)
print(output.shape)

tf.keras.layers.TimeDistributed期望输入:形状的输入张量(batch, time, ...)

工作示例代码

inputs = tf.keras.Input(shape=(10, 128, 128, 3))
conv_2d_layer = tf.keras.layers.Conv2D(64, (3, 3))
outputs = tf.keras.layers.TimeDistributed(conv_2d_layer)(inputs)
outputs.shape

推荐阅读