首页 > 解决方案 > OutputProjectionWrapper vs fully connected layer on top of RNN

问题描述

I'm reading the 14th chapter of Hands-On Machine Learning with Scikit-Learn and TensorFlow. It says:

Although using an OutputProjectionWrapper is the simplest solution to reduce the dimensionality of the RNN’s output sequences down to just one value per time step (per instance), it is not the most efficient. There is a trickier but more efficient solution: you can reshape the RNN outputs, then apply a single fully connected layer with the appropriate output size. [...] This can provide a significant speed boost since there is just one fully connected layer instead of one per time step.

This makes no sense to me. In case of OutputProjectionWrapper we need to perform 2 operations per time step:

  1. Calculate new hidden state based on previous hidden state and input.
  2. Calculate output by applying dense layer to calculated hidden state.

Of course, when we use plain BasicRNNCell + dense layer on top, we need to do only one operation on each time step (the first one), but then we need to pipe each output tensor through our dense layer. So we need to perform the very same amount of operations in the both cases.

Also, I can't understand the following part:

This can provide a significant speed boost since there is just one fully connected layer instead of one per time step.

Don't we have only one fully connected layer in both cases? As far as I understand OutputProjectionWrapper uses the same shared layer on each time step. I don't even know how it can create different layer for every time step because OutputProjectionWrapper has no information about the amount of time steps we will be using.

I will be very grateful if someone can explain the difference between these approaches.

UPD Here is pseudocode for the question. Am I missing something?

# 2 time steps, x1 and x2 - inputs, h1 and h2 - hidden states, y1 and y2 - outputs.

# OutputProjectionWrapper
h1 = calc_hidden(x1, 0)
y1 = dense(h1)
h2 = calc_hidden(x2, h1)
y2 = dense(h2)

# BasicRNNCell + dense layer on top of all time steps
h1 = calc_hidden(x1, 0)
y1 = h1
h2 = calc_hidden(x2, h1)
y2 = h2

y1 = dense(y1)
y2 = dense(y2)

UPD 2 I've created two small code snippets (one with OutputProjectionWrapper and another with BasicRNNCell and tf.layers.dense on top) - both created 14 variables with the same shape. So there is definitely no memory differences between these approaches.

标签: tensorflowmachine-learningdeep-learningrecurrent-neural-network

解决方案


我的猜测是,由于矩阵乘法优化,将 1 层应用于形状 (x, n) 的张量比将同一层应用于形状 (x) 的张量 n 次要快。


推荐阅读