lab2 word2vec
part 1 了解word2vec
在word2vec
中,通过矢量点积和应用softmax
函数
\[P(O=o,C=c) = \frac{\exp(u_o^T)}{\sum_{w\in Vocab}\exp(u_w^Tv_c)}
\]
这其中\(u_0\)是词 $ o $ 的外部向量,\(v_c\) 是词 \(v\) 的中心向量。为了包含这些参数,计算中共需要两个参数矩阵\(U\)和\(V\)。其中\(U\)是外部向量矩阵,每一列代表一个词的外部向量。\(V\)是中心词向量,每一列代表一个中心词的向量。
该方法的损失函数为
\[J_{naive-softmax}(v_c,o,U) = -\log P(O=o|C=c)
\]
(a) 证明naive-softmax损失函数与cross-entropy损失函数(交叉熵损失函数)等价
交叉熵损失函数为
\[-\sum_{w\in Vocab}y_w\log(\hat{y_w})
\]
由于\(y\)为one-hot
向量,有\(y_o=1\),其余为0,故上式为
\[-\sum_{w\in Vocab}y_w\log(\hat{y_w}) = -\log(\hat{y_o}) = -\log P(O=o|C=c)
\]
(b) 求上述损失函数对于\(v_c\)的偏导数,使用\(y,\hat{y},U\)表示结果
已知softmax函数\(S_i = \frac{\exp(a_i)}{\sum_j\exp(a_j)}\) ,令\(sum = \sum_j\exp(a_j)\),则\(S_i\)对\(a_i\)的偏导数为
\[\frac{\partial S_i}{\partial a_i} =\frac{e^{a_i}sum - e^{a_i}e^{a_i}}{sum^2} = \frac{e^{a_i}}{sum} \frac{sum-e^{a_i}}{sum} = S_i(1-S_i)
\]
故
\[\frac{\partial -\log S_i}{\partial a_i} = S_i-1
\]
\(S_i\)对\(a_j(i\neq j)\)的偏导数为
\[\frac{\partial S_i}{\partial a_j} =-\frac{ e^{a_i}e^{a_j}}{sum^2} = - S_iS_j
\]
故
\[\frac{\partial -\log S_i}{\partial a_j} = S_j
\]
由此可求\(J\)对\(v_c\)的偏导数。可令\(a_i = u_i^Tv_c\),已知\(y\)为one-hot
向量,\(y_o=1\),其余位置值均为0。\(\hat{y}\)向量为softmax函数输出向量,\(\hat{y}_i=S_i\)。则
\[\frac{\partial J(v_c,o,U)}{\partial v_c} = \sum_i\frac{\partial J}{\partial a_i}\frac{\partial a_i}{\partial v_c} = (S_o-1)u_o+\sum_{w\neq o,w\in vocab}S_wu_w=\sum_{w\in vocab}S_wu_w-u_o=U(\hat{y}-y)
\]
(c) 求上述损失函数对于\(u_w\)的偏导数,分别讨论\(w=o\)和\(w\neq o\)的情况,使用\(y,\hat{y},v_c\)表示结果
设\(a_i=u_i^Tv_c\),则序列\(a\)中只有\(a_w\)中包含\(u_w\),故
\[\frac{\partial J(v_c,o,U)}{\partial u_w}=\frac{\partial J}{\partial a_w}\frac{\partial a_w}{\partial u_c}
\]
当\(w=o\)时
\[\frac{\partial J(v_c,o,U)}{\partial u_w}=\frac{\partial J}{\partial a_w}\frac{\partial a_w}{\partial u_c}=(S_o-1)v_c
\]
当\(w \neq o\)时
\[\frac{\partial J(v_c,o,U)}{\partial u_w}=\frac{\partial J}{\partial a_w}\frac{\partial a_w}{\partial u_c}=S_wv_c
\]
故
\[\frac{\partial J(v_c,o,U)}{\partial U} = [S_1v_c,S_2v_c,...,(S_o-1)v_c,...S_nv_c]=v_c(\hat{y}-y)^T
\]
(d) 探索\(sigmoid\)函数的性质,当自变量为向量\(\boldsymbol{x}\)时,求其对向量\(\boldsymbol{x}\)的导数
\[\sigma(x) = \frac{1}{1+e^{-x}} = \frac{e^x}{1+e^{x}}
\]
\(sigmoid\)函数的基本性质:
\[\sigma(-x) =\frac{1}{1+e^{x}} = 1-\sigma(x)\\sigma‘(x) =\frac{e^x}{(e^x+1)^2} = \sigma(x)(1-\sigma(x))=\sigma(x)\sigma(-x) \\]
故
\[\frac{\partial \sigma(x_i)}{\partial x_i}=\sigma‘(x)\\frac{\partial \sigma(x_i)}{\partial x_j}=0(i\neq j)\\frac{\partial \sigma(\boldsymbol{x})}{\partial \boldsymbol{x}}=[\frac{\partial \sigma(x_i)}{\partial x_j}]_{d\times d}=diag(\sigma‘(\boldsymbol{x}))
\]
(e) 现在我们考虑负采样损失,负采样损失是softmax
损失函数的替代。选取k个outside word\(w_1,w_2,...,w_k\),其outside vector为\(u_1,u_2,u_3,...,u_k\),注意,外部词\(o\notin \{w_i\}\) 。负采样损失函数如下所示,重复(b),(c)的步骤,求其对\(v_c,u_k,u_o\)的偏导数
\[J_{neg-sample}(v_c,o,U)=-\log(\sigma(u_o^Tv_c))-\sum_{i=1}^k\log(\sigma(-u_k^Tv_c))
\]
解:已知
\[\frac{d (-\log(\sigma(x))}{d x}=-\frac{1}{\sigma(x)}\sigma(x)\sigma(-x)=-\sigma(-x)
\]
故
\[\begin{align}
\frac{\partial J_{neg-sample}}{\partial v_c}&=\sigma(-u_o^Tv_c)u_o+\sum_{i=1}^k\sigma(u_i^Tv_c)u_i \\frac{\partial J_{neg-sample}}{\partial u_o}&=\sigma(-u_o^Tv_c)v_c\\frac{\partial J_{neg-sample}}{\partial u_w}&=\sigma(u_w^Tv_c)v_c
\end{align}
\]
(f) 假设中心词\(c=w_t\),上下文窗口为\([w_{t-m},...,w_{t-1},w_{t+1},...,w_{t+m}]\),\(m\)为滑动窗口的大小,则
\[J_{skip-gram(v_c,w_{t-m},...,w_{t+m},U)} = \sum_{-m\leq j\leq m,j\neq 0}J(v_c,w_{t+j},U)
\]
则有
\[\begin{align}
\frac{\partial J_{sg}}{\partial U} &= \sum_{-m\leq j\leq m,j\neq 0}\frac{\partial J(v_c,w_{t+j},U)}{\partial U}\\frac{\partial J_{sg}}{\partial v_c}& =\sum_{-m\leq j\leq m,j\neq 0}\frac{\partial J(v_c,w_{t+j},U)}{\partial v_c} \\frac{\partial J_{sg}}{\partial v_w}&=0(w\neq c)
\end{align}
\]
cs224n assignment2 word2vec
原文:https://www.cnblogs.com/YaokaiCheng/p/14451671.html