XLNet and Robertra

时间：2019-07-31 22:45:37 阅读：129 评论：0 收藏：0 [点我收藏+]

XLNET

? ?

But the AE language model also has its disadvantages. It uses the?[MASK]?in the?pretraining, but this kind of artificial symbols are absent from the real data at?finetuning time,?resulting in a?pretrain-finetune discrepancy.Another disadvantage of [MASK] is that?it assumes the predicted (masked) tokens are independent of each other given the unmasked tokens. For example, we have a sentence "It shows that the housing crisis was turned into a banking crisis". We mask "banking" and "crisis". Attention here, we know the masked "banking" and "crisis" contains implicit relation to each other. But AE model is trying to predict "banking" given unmasked tokens, and predict "crisis" given unmasked tokens separately. It ignores the relation between "banking" and "crisis". In other words, it assumes the?predicted (masked) tokens are independent of each other.?But we know the model should learn such correlation among the predicted (masked) tokens to predict one of the tokens.

? ?

技术分享图片

? ?

A traditional language model would predict the tokens in the order

"I", "like", "cats", "more", "than", "dogs"

where each token uses all previous tokens as context.

In permutation language modeling, the order of prediction is not necessarily left to right and is sampled randomly instead. For instance, it could be

"cats", "than", "I", "more", "dogs", "like"

where "than" would be conditioned on seeing "cats", "I" would be conditioned on seeing "cats,?than" and so on. The following animation demonstrates this.

? ?

输入看上去仍然是x1,x2,x3,x4，可以通过不同的掩码矩阵，让当前单词Xi只能看到被排列组合后的顺序x3->x2->x4->x1中自己前面的单词

? ?

技术分享图片

? ?

技术分享图片

? ?

Robertra

? ?

与BERT的差别

Facebook的研究人员发现超参数选择对BERT的最终结果有重大影响，因此他们重新研究了BERT的预训练模型，测量了不同超参数和训练集大小的影响，结果发现BERT存在明显的训练不足。

经过调整后，BERT可以达到或超过其后发布的每个模型的性能，这些结果突出了之前被忽视的设计选择的重要性，

RoBERTa与BERT的不同之处在于，它依赖于预训练大量数据和改变训练数据的mask模式，而且RoBERTa删除了下一句预测（NSP）。

RoBERTa的修改很简单，包括：

更长时间的训练时间，更大的batch，更多的数据；
删除下一句预测（NSP）目标；
在较长序列上进行训练；
动态改变用于训练数据的mask模式。（The original BERT implementation performed masking once during data preprocessing, resulting in a single static mask. To avoid using the same mask for each training instance in every epoch, training data was duplicated 10 times so that each sequence is masked in 10 different ways over the 40 epochs of training. Thus, each training sequence was seen with the same mask four times during training.）

? ?

参考：

? ?

https://zhuanlan.zhihu.com/p/75856238

https://towardsdatascience.com/what-is-xlnet-and-why-it-outperforms-bert-8d8fce710335

https://zhuanlan.zhihu.com/p/70257427

https://mlexplained.com/2019/06/30/paper-dissected-xlnet-generalized-autoregressive-pretraining-for-language-understanding-explained/

XLNet and Robertra

原文：https://www.cnblogs.com/wuxiangli/p/11279245.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)