首页 > 技术文章 > 有关一些求导的概念与神经网络梯度下降

wevolf 2019-05-03 20:34 原文

Theory for f : \(\mathbb{R}^{n} \mapsto \mathbb{R}\)

先定义一个标识:scalar-product \(\langle a | b\rangle=\sum_{i=1}^{n} a_{i} b_{i}\)

我们可以定义导数的公式如下:

\[f(x+h)=f(x)+\mathrm{d}_{x} f(h)+o_{h \rightarrow 0}(h) \]

\(O_{h \rightarrow 0}(h)\) 满足 \(\lim _{h \rightarrow 0} \epsilon(h)=0\)

\(\mathbb{R}^{n} \mapsto \mathbb{R}\)是一个线性变换。形如:

\[f\left(\left( \begin{array}{l}{x_{1}} \\ {x_{2}}\end{array}\right)\right)=3 x_{1}+x_{2}^{2} \]

\(\left( \begin{array}{l}{a} \\ {b}\end{array}\right) \in \mathbb{R}^{2}\) and \(h=\left( \begin{array}{l}{h_{1}} \\ {h_{2}}\end{array}\right) \in \mathbb{R}^{2}\)时,我们有

\[\begin{aligned} f\left(\left( \begin{array}{c}{a+h_{1}} \\ {b+h_{2}}\end{array}\right)\right) &=3\left(a+h_{1}\right)+\left(b+h_{2}\right)^{2} \\ &=3 a+3 h_{1}+b^{2}+2 b h_{2}+h_{2}^{2} \\ &=3 a+b^{2}+3 h_{1}+2 b h_{2}+h_{2}^{2} \\ &=f(a, b)+3 h_{1}+2 b h_{2}+o(h) \end{aligned} \]

也就是说:\(\mathrm{d}_{(\begin{array}{l}{a} \\ {b}\end{array})} f\left(\left( \begin{array}{l}{h_{1}} \\ {h_{2}}\end{array}\right)\right)=3 h_{1}+2 b h_{2}\).

神经网络中的梯度下降

Vectorized Gradients

我们可以将一个 \(\mathbb{R}^{n} \rightarrow \mathbb{R}^{m}\) 矩阵(线性变换)看做一个函数 \(\boldsymbol{f}(\boldsymbol{x})=\left[f_{1}\left(x_{1}, \ldots, x_{n}\right), f_{2}\left(x_{1}, \ldots, x_{n}\right), \ldots, f_{m}\left(x_{1}, \ldots, x_{n}\right)\right]\) 。向量 \(\boldsymbol{x}\) = \(x_{1}, \dots, x_{n}\)。 矩阵对向量求导就是:

\[\frac{\partial \boldsymbol{f}}{\partial \boldsymbol{x}}=\left[ \begin{array}{ccc}{\frac{\partial f_{1}}{\partial x_{1}}} & {\cdots} & {\frac{\partial f_{1}}{\partial x_{n}}} \\ {\vdots} & {\ddots} & {\vdots} \\ {\frac{\partial f_{m}}{\partial x_{1}}} & {\cdots} & {\frac{\partial f_{m}}{\partial x_{n}}}\end{array}\right] \]

上述的求导矩阵又叫做雅可比矩阵。其中的元素可以写成:

\[\left(\frac{\partial \boldsymbol{f}}{\partial \boldsymbol{x}}\right)_{i j}=\frac{\partial f_{i}}{\partial x_{j}} \]

这个矩阵很有用。对于神经网络我们可以理解为 向量的函数,及 \(\boldsymbol{f}(\boldsymbol{x})\), 而这个函数的本质是对向量的线性变换,也就是一个矩阵。在神经网络的反向传播的过程就需要对参数求偏导,那么多层神经网络就会是链式求导,雅可比矩阵就是用于矩阵的链式求导。

考虑下面的例子: \(f(x)=\left[f_{1}(x), f_{2}(x)\right]\)\(f\) 是一个 1 * 2 的矩阵。\(g(y)=\left[g_{1}\left(y_{1}, y_{2}\right), g_{2}\left(y_{1}, y_{2}\right)\right]\) 是一个 2 * 2的矩阵。那么 \(f\)\(g\) 的复合矩阵就是 \(g * f\) . 由矩阵的连乘得到,\(g(x)=\left[g_{1}\left(f_{1}(x), f_{2}(x)\right), g_{2}\left(f_{1}(x), f_{2}(x)\right)\right]\)。那么我们对复合矩阵求导就是:

\[\frac{\partial \boldsymbol{g}}{\partial x}=\left[ \begin{array}{c}{\frac{\partial}{\partial x} g_{1}\left(f_{1}(x), f_{2}(x)\right)} \\ {\frac{\partial}{\partial x} g_{2}\left(f_{1}(x), f_{2}(x)\right)}\end{array}\right]=\left[ \begin{array}{c}{\frac{\partial g_{1}}{\partial f_{1}} \frac{\partial f_{1}}{\partial x}+\frac{\partial g_{1}}{\partial f_{2}} \frac{\partial f_{2}}{\partial x}} \\ {\frac{\partial g_{2}}{\partial f_{1}} \frac{\partial f_{1}}{\partial x}+\frac{\partial g_{2}}{\partial f_{2}} \frac{\partial f_{2}}{\partial x}}\end{array}\right] \]

本质上与一般连续函数的求导是一样的。这个矩阵等价于两次的求导矩阵的矩阵乘积。

\[\frac{\partial g}{\partial x}=\frac{\partial g}{\partial f} \frac{\partial f}{\partial x}=\left[ \begin{array}{ll}{\frac{\partial g_{1}}{\partial f_{1}}} & {\frac{\partial g_{1}}{\partial f_{2}}} \\ {\frac{\partial g_{2}}{\partial f_{1}}} & {\frac{\partial g_{2}}{\partial f_{2}}}\end{array}\right] \left[ \begin{array}{c}{\frac{\partial f_{1}}{\partial x}} \\ {\frac{f_{2}}{\partial x}}\end{array}\right] \]

Useful Identities

对于一般的矩阵

我们将一般的矩阵理解为:\(\boldsymbol{W} \in \mathbb{R}^{n \times m}\),将一个 \(m\) 维的向量变换成一个 \(n\) 维的向量。也可以写成 \(z=W x\)。 其中:

\[z_{i}=\sum_{k=1}^{m} W_{i k} x_{k} \]

所以对每一项的求导也很好计算:

\[\left(\frac{\partial \boldsymbol{z}}{\partial \boldsymbol{x}}\right)_{i j}=\frac{\partial z_{i}}{\partial x_{j}}=\frac{\partial}{\partial x_{j}} \sum_{k=1}^{m} W_{i k} x_{k}=\sum_{k=1}^{m} W_{i k} \frac{\partial}{\partial x_{j}} x_{k}=W_{i j} \]

因此有:

\[\frac{\partial z}{\partial x}=W \]

一般的矩阵的另一种写法可以写成:\(z=x W\), 其中 \(x\) 是一个行向量。维度为 m,这就可以看成是 \(W\) 列向量的线性组合。这时:

\[\frac{\partial z}{\partial x}=\boldsymbol{W}^{T} \]

从另一个角度看一般的矩阵

我们可以将线性变换的矩阵写成:\(z=W x\)。我们之前都是将 \(x\) 看成参数求导,如果这次我们将 \(W\) 看成参数求导会是什么样子呢?

我们假设有:

\[z=\boldsymbol{W} \boldsymbol{x}, \boldsymbol{\delta}=\frac{\partial J}{\partial \boldsymbol{z}} \]

\[\frac{\partial J}{\partial \boldsymbol{W}}=\frac{\partial J}{\partial \boldsymbol{z}} \frac{\partial \boldsymbol{z}}{\partial \boldsymbol{W}}=\delta \frac{\partial \boldsymbol{z}}{\partial \boldsymbol{W}} \]

这个怎么求呢?对于 \(\frac{\partial z}{\partial W}\) 我们可以写成

\[\begin{aligned} z_{k} &=\sum_{l=1}^{m} W_{k l} x_{l} \\ \frac{\partial z_{k}}{\partial W_{i j}} &=\sum_{l=1}^{m} x_{l} \frac{\partial}{\partial W_{i j}} W_{k l} \end{aligned} \]

也就是:

\[\frac{\partial z}{\partial W_{i j}}=\left[ \begin{array}{c}{0} \\ {\vdots} \\ {0} \\ {x_{j}} \\ {0} \\ {\vdots} \\ {0}\end{array}\right] \]

因此有:

\[\frac{\partial J}{\partial W_{i j}}=\frac{\partial J}{\partial z} \frac{\partial z}{\partial W_{i j}}=\delta \frac{\partial z}{\partial W_{i j}}=\sum_{k=1}^{m} \delta_{k} \frac{\partial z_{k}}{\partial W_{i j}}=\delta_{i} x_{j} \]

所以 \(\frac{\partial J}{\partial \boldsymbol{W}}=\boldsymbol{\delta}^{T} \boldsymbol{x}^{T}\)

同理,如果我们将矩阵写成 \(z=x W\),那么 \(\frac{\partial J}{\partial W}=x^{T} \delta\)

一层神经网络的例子

简单描述一个神经网络如下,我们采用交叉熵损失函数作为损失函数,来优化参数,神经网络描述如下:

\[\begin{array}{l}{x=\text { input }} \\ {z=W x+b_{1}} \\ {h=\operatorname{ReLU}(z)} \\ {\theta=U h+b_{2}} \\ {\hat{y}=\operatorname{softmax}(\theta)} \\ {J=C E(y, \hat{y})}\end{array} \]

对于其中数据的维度可以表示成:

\[\boldsymbol{x} \in \mathbb{R}^{D_{x} \times 1} \quad \boldsymbol{b}_{1} \in \mathbb{R}^{D_{h} \times 1} \quad \boldsymbol{W} \in \mathbb{R}^{D_{h} \times D_{x}} \quad \boldsymbol{b}_{2} \in \mathbb{R}^{N_{c} \times 1} \quad \boldsymbol{U} \in \mathbb{R}^{N_{c} \times D_{h}} \]

其中 \(D_{x}\) 是输入的维度,\(D_{h}\) 是隐含层的维度,\(N_{c}\)是分类的种数。

我们需要求得梯度是:

\[\frac{\partial J}{\partial U} \quad \frac{\partial J}{\partial b_{2}} \quad \frac{\partial J}{\partial W} \quad \frac{\partial J}{\partial b_{1}} \quad \frac{\partial J}{\partial x} \]

这些都比较好计算,

\[\delta_{1}=\frac{\partial J}{\partial \theta} \quad \delta_{2}=\frac{\partial J}{\partial z} \]

\[\begin{aligned} \delta_{1} &=\frac{\partial J}{\partial \theta}=(\hat{y}-y)^{T} \\ \delta_{2} &=\frac{\partial J}{\partial z}=\frac{\partial J}{\partial \theta} \frac{\partial \theta}{\partial h} \frac{\partial h}{\partial z} \\ &=\delta_{1} \frac{\partial \theta}{\partial h} \frac{\partial h}{\partial z} \\ &=\delta_{1} U \frac{\partial h}{\partial z} \\ &=\delta_{1} U \circ \operatorname{ReLU}^{\prime}(z) \\ &=\delta_{1} U \circ \operatorname{sgn}(h) \end{aligned} \]

关于神经网络的反向传播,这里是使用的交叉熵损失函数的计算,如果是最小二乘法,推荐下面的是视频:

BP

推荐阅读