NLP’s State completely changed when in 2018, researchers from Google open-sourced BERT (Bi-Directional Encoder Representation From Transformers). The whole idea of going from a sequence to sequence transformer model to self-supervised training of just the encoder representation which can be used for downstream tasks such as classification was just mind-blowing. Ever since that day efforts have been made to improve such encoder-based models in different ways to do better on NLP benchmarks. In 2019, FacebookAI open-sourced Roberta who has been ruling as the best performer for all tasks up till now, but now the throne seems to be shifting towards the new king DeBerta released by Microsoft Research in 2022. DeBerta-v3 has beaten Roberta by big margins not only in the recent NLP Kaggle competitions but also on big NLP benchmarks.
In this article, we will deep dive into the DeBerta paper by Pengcheng He et. al., 2020 and see how it improves over the SOTA Bert and RoBerta. We will also explore the results and techniques to use the model efficiently for downstream tasks. DeBerta gets its name from the two novel techniques it introduces, through which it claims to improve over BERT and RoBerta :
- Disentangled Attention Mechanism
- Enhanced Mask Decoder
Decoding-enhanced BERT with disentangled attention (DeBerta)
Now to understand the above techniques, the first step is to understand how Roberta and other encoder-type networks work, let’s call this context and discuss it in the next section.