全连接神经网络
![技术分享图片](http://image1.bubuko.com/info/202107/20210726223441349804.jpg)
写在开头:
本文提供了理解反向传播算法的一些思路。完整推导反向传播算法的比较少见,大多都是给出几个公式,没有更加详细的推导,这对非数学系的初学者非常不友好。
本文绕开Hadmard积,用雅可比矩阵(见高等数学多元微积分)来解释与推导,虽然可能会变得复杂,但对于只接触过线性代数的初学者可能比较友好。
能力有限,欢迎纠错。
0.定义
\[\begin{aligned}
& 1.每层的输入以及输出都是一个一维的列向量,假设上一层的输出是k×1的列向量,当前层的输出是j×1的列向量,则权重矩阵的维度为j×k,偏置项的维度为j×1。\& 2.第l层经过激活函数之前的输出:z^l=W^{l}a^{l-1}+b^{l}\& 3.第l层经过激活函数之后的输出:a^{l}={\sigma}(z^l)\& 4.损失函数:以C={\frac{1}{2}}{||a^l-y||_2^2}为例\&
\end{aligned}
\]
1.中间量\(\ \delta^l\)
为计算方便引入中间量\(\ \delta^l\),称为第\(\ l\)层的\(\delta\)误差,表示误差函数对于神经网络第\(\ l\)层激活前输出值的偏导数,即 :
\[\begin{aligned}
& {\delta}^l=\frac{\partial C}{\partial z^l}=\frac{\partial C}{\partial a^l}\frac{\partial a^l}{\partial z^l}=(a^l-y)*{\sigma{‘}(z^l)}----①\& 注:‘*‘为Hadmard积,即对应逐元素相乘,与矩阵乘法相区分。
\end{aligned}
\]
2.权重矩阵和偏置项
\[\begin{aligned}
& \frac{\partial C}{\partial W^l}=\frac{\partial C}{\partial z^l}\frac{\partial z^l}{\partial W^l}={\delta^l(a^{l-1})^T}-----②\& \frac{\partial C}{\partial b^l}=\frac{\partial C}{\partial z^l}\frac{\partial z^l}{\partial b^l}={\delta^l*1}=\delta^l------③
\end{aligned}
\]
3.上一层的\(\ \delta^{l-1}\)误差
\[\begin{aligned}
& ~~~~~{\delta}^{l-1}=\frac{\partial C}{\partial z^{l-1}}=\frac{\partial C}{\partial z^{l}}\frac{\partial z^{l}}{\partial z^{l-1}}={\delta}^{l}\frac{\partial z^{l}}{\partial z^{l-1}}\& ∵z^{l}=W^{l}a^{l-1}+b^{l}\& \& ∴{\delta}^{l-1}=(W^{l})^T{\delta}*{\sigma{‘}(z^l)}-------④
\end{aligned}
\]
以此类推:除输入层外,可以得到每一层的误差。
4.梯度下降法更新参数
在求得每一层的\(\ \delta\)误差后,可以由式②③求出误差函数C对于每一层参数的梯度:
\[\begin{aligned}
& \frac{\partial C}{\partial W^l}=\frac{\partial C}{\partial z^l}\frac{\partial z^l}{\partial W^l}={\delta^l(a^{l-1})^T}\& \frac{\partial C}{\partial b^l}=\frac{\partial C}{\partial z^l}\frac{\partial z^l}{\partial b^l}={\delta^l*1}=\delta^l
\end{aligned}
\]
更新参数:
\[\begin{aligned}
& W^l=W^l-{\eta}\frac{\partial C}{\partial W^l}\& b^l=b^l-{\eta}\frac{\partial C}{\partial b^l}
\end{aligned}
\]
以上参考于:链接
公式详解(1、2、3)
1.上述公式推导有一定难度,但想要学好深度学习,对公式的理解是必不可少的 。
2.由于我个人的能力,理解也就到这了,如果有误欢迎纠错。
理解一下内容所需基础:
1.高等数学---多元微积分
2.线性代数
3.雅可比矩阵和链式求导(单独列出是有原因的)
1.中间量\(\ \delta^l\)
\[\begin{aligned}
& {\delta}^l=\frac{\partial C}{\partial z^l}=\frac{\partial C}{\partial a^l}\frac{\partial a^l}{\partial z^l}=(a^l-y)*{\sigma{‘}(z^l)}-----①\\end{aligned}
\]
准备工作
\[\begin{aligned}
& C={\frac{1}{2}}{||a^l-y||_2^2}=\frac{1}{2}
[(a_1^l-y_1)^2+
(a_2^l-y_2)^2+
...+
(a_j^l-y_j)^2]\& a^l=\sigma(z^l)\& z^l={\left[ {\begin{matrix}
z_1^l&z_2^l&...&z_j^l
\end{matrix}} \right]^T}\\end{aligned}
\]
推导
\[\begin{aligned}
& {\delta}^l=\frac{\partial C}{\partial z^l}&~~~={\left[ {\begin{matrix}
\frac{\partial C}{\partial z_1^l}&
\frac{\partial C}{\partial z_2^l}&
...&
\frac{\partial C}{\partial z_j^l}
\end{matrix}} \right]^T}(雅可比矩阵)\
&~~~={\left[ {\begin{matrix}
\frac{\partial C}{\partial a_1^l}\frac{\partial a_1^l}{\partial z_1^l}&
\frac{\partial C}{\partial a_2^l}\frac{\partial a_2^l}{\partial z_2^l}&
...&
\frac{\partial C}{\partial a_j^l}\frac{\partial a_j^l}{\partial z_j^l}
\end{matrix}} \right]^T}&~~~={\left[ {\begin{matrix}
(a_1^l-y_1)\sigma{‘}(z_1^l)&
(a_2^l-y_2)\sigma{‘}(z_2^l)&
...&
(a_j^l-y_j)\sigma{‘}(z_j^l)
\end{matrix}} \right]^T}\
&~~~=(a^l-y)*{\sigma{‘}(z^l)}
\end{aligned}
\]
2.权重矩阵和偏置项
权重矩阵
\[\begin{aligned}
& \frac{\partial C}{\partial W^l}=\frac{\partial C}{\partial z^l}\frac{\partial z^l}{\partial W^l}={\delta^l(a^{l-1})^T}-----②
\end{aligned}
\]
准备工作
\[\begin{aligned}
& C={\frac{1}{2}}{||a^l-y||_2^2}=\frac{1}{2}
[(a_1^l-y_1)^2+
(a_2^l-y_2)^2+
...+
(a_j^l-y_j)^2]\& a^l=\sigma(z^l)\& a^l={\left[ {\begin{matrix}
a_1^l&a_2^l&...&a_j^l
\end{matrix}} \right]^T}~~~~~
a^{l-1}={\left[ {\begin{matrix}
a_1^{l-1}&a_2^{l-1}&...&a_k^{l-1}
\end{matrix}} \right]^T}\& z^l={\left[ {\begin{matrix}
z_1^l&z_2^l&...&z_j^l
\end{matrix}} \right]^T}& \& \& W_{j \times k}=
\left[ {\begin{matrix}
w_{11}&w_{12}&...&w_{1k}\ w_{21}&w_{22}&...&w_{2k}\ ...&...&...&...\ w_{j1}&w_{j2}&...&w_{jk}\ \end{matrix}} \right]
={\left[ {\begin{matrix}
w_1^l&w_2^l&...&w_j^l
\end{matrix}} \right]^T}& a_j^l=\sigma(\sum_{k}{w_{jk}^la_k^{l-1}}+b_j^l)=\sigma(z_j^l)\& z_j^l=\sum_{k}{w_{jk}^la_k^{l-1}}+b_j^l\end{aligned}
\]
推导
\[\begin{aligned}
& \frac{\partial C}{\partial W^l}
={\left[ {\begin{matrix}
\frac{\partial C}{\partial a_1^l}\frac{\partial a_1^l}{\partial z_1^l}\frac{\partial z_1^l}{\partial w_1^l}&
\frac{\partial C}{\partial a_2^l}\frac{\partial a_2^l}{\partial z_2^l}\frac{\partial z_2^l}{\partial w_2^l}&
...&
\frac{\partial C}{\partial a_j^l}\frac{\partial a_j^l}{\partial z_j^l}\frac{\partial z_j^l}{\partial w_j^l}
\end{matrix}} \right]^T}\
&~~~~~~~~~={\left[ {\begin{matrix}
\frac{\partial C}{\partial a_1^l}\frac{\partial a_1^l}{\partial z_1^l}\frac{\partial z_1^l}{\partial w_{11}^l}&
\frac{\partial C}{\partial a_1^l}\frac{\partial a_1^l}{\partial z_1^l}\frac{\partial z_1^l}{\partial w_{12}^l}&
...&
\frac{\partial C}{\partial a_1^l}\frac{\partial a_1^l}{\partial z_1^l}\frac{\partial z_1^l}{\partial w_{1k}^l}\
\frac{\partial C}{\partial a_2^l}\frac{\partial a_2^l}{\partial z_2^l}\frac{\partial z_2^l}{\partial w_{21}^l}&
\frac{\partial C}{\partial a_2^l}\frac{\partial a_2^l}{\partial z_2^l}\frac{\partial z_2^l}{\partial w_{22}^l}&
...&
\frac{\partial C}{\partial a_2^l}\frac{\partial a_2^l}{\partial z_2^l}\frac{\partial z_2^l}{\partial w_{2k}^l}\
...&
...&
...&
...\
\frac{\partial C}{\partial a_j^l}\frac{\partial a_j^l}{\partial z_j^l}\frac{\partial z_j^l}{\partial w_{j1}^l}&
\frac{\partial C}{\partial a_j^l}\frac{\partial a_j^l}{\partial z_j^l}\frac{\partial z_j^l}{\partial w_{j2}^l}&
...&
\frac{\partial C}{\partial a_j^l}\frac{\partial a_j^l}{\partial z_j^l}\frac{\partial z_j^l}{\partial w_{jk}^l}\
\end{matrix}} \right]}(雅可比矩阵)&~~~~~~~~~={\delta^l(a^{l-1})^T}
\end{aligned}
\]
偏置项
同理(这个理解应该不难)
3.上一层的\(\ \delta^{l-1}\)误差
\[\begin{aligned}
& {\delta}^{l-1}=(W^{l})^T{\delta}*{\sigma{‘}(z^l)}-----④
\end{aligned}
\]
准备工作
\[\begin{aligned}
& z^{l-1}={\left[ {\begin{matrix}
z_1^{l-1}&
z_2^{l-1}&
...&
z_k^{l-1}
\end{matrix}} \right]^T}\& C={\frac{1}{2}}{||a^l-y||_2^2}=\frac{1}{2}
[(a_1^l-y_1)^2+
(a_2^l-y_2)^2+
...+
(a_j^l-y_j)^2]\& a_j^l=\sigma(\sum_{k}{w_{jk}^la_k^{l-1}}+b_j^l)=\sigma(z_j^l)\& z_j^l=\sum_{k}{w_{jk}^la_k^{l-1}}+b_j^l\& a_k^{l-1}=\sigma(z_k^{l-1})\\end{aligned}
\]
推导
\[\begin{aligned}
& \delta=\frac{\partial C}{\partial z^{l-1}}\&~~={\left[ {\begin{matrix}
\frac{\partial C}{\partial z_1^{l-1}}&
\frac{\partial C}{\partial z_2^{l-1}}&
...&
\frac{\partial C}{\partial z_k^{l-1}}
\end{matrix}} \right]^T}\&~~={\left[ {\begin{matrix}
\sum_{i=1}^{j}{\frac{\partial C}{\partial a_i^{l}}\frac{\partial a_i^{l}}{\partial z_i^{l}}\frac{\partial z_i^{l}}{\partial a_i^{l-1}}\frac{\partial a_i^{l-1}}{\partial z_1^{l-1}}}&
\sum_{i=1}^{j}{\frac{\partial C}{\partial a_i^{l}}\frac{\partial a_i^{l}}{\partial z_i^{l}}\frac{\partial z_i^{l}}{\partial a_i^{l-1}}\frac{\partial a_i^{l-1}}{\partial z_2^{l-1}}}&
...&
\sum_{i=1}^{j}{\frac{\partial C}{\partial a_i^{l}}\frac{\partial a_i^{l}}{\partial z_i^{l}}\frac{\partial z_i^{l}}{\partial a_i^{l-1}}\frac{\partial a_i^{l-1}}{\partial z_k^{l-1}}}
\end{matrix}} \right]^T}\
&~~=(W^{l})^T{\delta}*{\sigma{‘}(z^l)}
\end{aligned}
\]
注:有求和符号是因为,损失函数C中有\(\ [a_1^l a_2^l ... a_j^l]\),其中某个\(\ a_i^l\)会受到上一层的每一个神经元输出\(\ a_i^{l-1}\)的影响(如果还是不能理解,再看一遍全连接神经网络图和准备工作中的公式)。
结尾:
差不多就写到这里了,总的来说对理解这些公式提供了一些思路,要真正理解这些公式必然逃不过在纸上从头到尾的推导。
2021.7.26 ghb
全连接神经网络反向传播算法推导
原文:https://www.cnblogs.com/430442-CmjAndGhb/p/15062772.html