# DeBerta is the new King!

NLP’s State completely changed when in 2018, researchers from Google open-sourced BERT (Bi-Directional Encoder Representation From Transformers).

The whole idea of going from a sequence to sequence transformer model to self-supervised training of just the encoder representation which can be used for downstream tasks such as classification was just mind-blowing. Ever since that day efforts have been made to improve such encoder-based models in different ways to do better on NLP benchmarks. In 2019, FacebookAI open-sourced Roberta who has been ruling as the best performer for all tasks up till now, but now the throne seems to be shifting towards the new king DeBerta released by Microsoft Research in 2022. DeBerta-v3 has beaten Roberta by big margins not only in the recent NLP Kaggle competitions but also on big NLP benchmarks.

## Introduction​

In this article, we will deep dive into the DeBerta paper by Pengcheng He et. al., 2020 and see how it improves over the SOTA Bert and RoBerta. We will also explore the results and techniques to use the model efficiently for downstream tasks. DeBerta gets its name from the two novel techniques it introduces, through which it claims to improve over BERT and RoBerta :

• Disentangled Attention Mechanism

Decoding-enhanced BERT with disentangled attention (DeBerta)

Now to understand the above techniques, the first step is to understand how Roberta and other encoder-type networks work, let’s call this context and discuss it in the next section.

## Getting some Context​

In this section, we will discuss the working and flow of three key techniques that Transformer based models introduced.

### Positional Encoding​

A Transformer-based model is composed of stacked Transformer Encoder Blocks. Each block contains a multi-head self-attention layer followed by a fully connected positional feed-forward network. The introduction of the feed-forward neural network instead of Sequential RNNs allowed for parallel execution of the model but since they are not sequential, they were not able to incorporate the positional information of words (i.e which word belonged to which position). To tackle this, the authors introduced the concept of Positional Encodings wherein they introduced positional vectors in addition to the word vectors and added them together to get the final vector representation for every word. Let us understand this through an example shown in below figure Embeddings in Transformers.

In our example sentence “I am a good boy”, we assume the tokenization to take place at the word level for simplicity's sake. So after tokenization, we will have the tokens as [ I, am, a, good, boy ] and their respective positions as [1, 2, 3, 4, 5]. Now before sending out the tokens into the transformer, we convert them into vectors of a certain dimension as we did for LSTM, but here the vector for each token is a sum of its word vector and position vector, so for the token “I” the final vector will be (word*vector_of_i + position_vector_of_position_one) and similarly for all other tokens. Now the obtained vector is represented by a vector whose value depends on its content and position. Then the transformer while training can understand the position of the word by certain series of activations. In the DeBerta paper, the authors argue that adding position embedding and word embedding together is not ideal because the positions are too mixed with the signal of the content of the word. Thus it introduces a noise which leads to lower performance and hence they propose Disentangled Attention mechanism in which they use two separate vectors for content and position and calculate attention using disentangled matrices using both vectors

Standard self-attention works by computing for every word from the input text an attention weight which gauges the influence each word has on another word. This attention mechanism uses an embedding vector that has position and context information mixed which helps in understanding the absolute positions of the words. Each of the tokens in the input text produces Query(Qi’s) and Key (Ki’s) vectors, whose inner product then results in Attention Vector (Ai’s). When we combine all queries and keys we get the Query Matrix and the Key matrix, their inner product gives us the attention matrix, where $Aij$ in the Attention matrix represents the attention weight of $token_j$ on $token_i$ which gauges the influence of $token_j$ on $token_i$

However, self-attention is not capable of naturally gauging the relative positions of the words. This is where the positional encoding comes in. Then we have multiple heads instead of a single head doing the same thing but on a different part of the embeddings thereby allowing for learning of different representations of the same word. In the DeBerta paper, the author claims that this is also not ideal and that both positions and contents should have separate signals.

With the Bert paper, the authors came out with a self-supervised pre-training technique that revolutionized NLP. They showed that a transformer model’s encoder can be trained using a Masked Language modeling Objective ( An Objective where 15 percent of the tokens in an input sentence are masked and the model has to predict the masked tokens ) and Next Sentence Prediction to incorporate the knowledge of a Language and that pre-trained model can be used for downstream tasks for that Language.

## The Components of DeBerta​

Now that we have the context of how the competitors work and their shortcomings, we can now dive deep into how DeBerta works and improves upon the shortcomings of its predecessor.

### Disentangled Attention Mechanism​

Unlike BERT where each word in the input layer is represented using a vector which is the sum of its word embedding and position embedding, in DeBerta each word is represented using two vectors that encode its content and position respectively and the attention weights among words are computed using disentangled matrices. This is motivated by the observation that the attention weight of a word pair not only depends on their contents but their relative positions as well. In BERT as explained above, every token produced a single Query and a Key vector and their inner product gave us the attention weights. Since every token there was presented by a single vector (H) the equation looked like this: $Q=HW_q,K=HW_k,V=HW_v,A=\frac{QK^T}{\sqrt{d}},H_o = softmax(A)V$ , where H represents the Hidden state or the embedding matrix for the whole input, $W_q$ represents a linear projection matrix for Query and key respectively and $A$ represents the Attention Matrix and $H_o$ is the output of self-attention In DeBerta for a token at position $i$ in a sequence, it is represented using two vectors, ${H_i}$ and $P{i|j}, which represent its content and its relative position concerning the token $j$. Let’s go back to our example to understand this better, In our example “I am a good boy”, if we have to calculate how$token2$(”am”) attends to $token*5$ (”boy”), we will first get two vectors from $token_2, H_2$ (word Vector for “am”) and$P*{2|+3}$(Because Relative Position Vector of token Position 2 concerning Position 5 is +3). Let’s look at the picture below to understand better. General Equation : $A*{i,j} = (H_i,P*{i,j})*(H*j,P*{j,i}) = H*i,H_j^T + H_i,P*{j,i}^T + P*{i,j}H_j^T + P*{i,j}\_P*{j,i}^T$ where $A*{i,j}$ is Attention weight for $token_j$ when looking from $token_i$ This new Disentangled Attention is a sum of four components whereas previously in BERT and Roberta it used to be a single term$H_i * H_j^T$. Thus this mechanism can capture much more information than the standard self-attention, Let’s look at the components • Content to Content • Content to Position • Position to Content • Position to Position Content to Content is similar to standard self-attention, where each word looks at all the different words in the input text and tries to gauge its importance itself. Content to Position term can be interpreted as $token_i$ (”am” in our case) trying to find out which position around it is important to look at and from which position around it should request more information than others. For Eg: Let’s say the model has figured out already that “I” should come before “am”, thus now for the token “am”, the information about “I” is not of much use now, using this term in attention, the word “am” can decide since I already the word before me will “I” thus I want to look at the words after myself. Position to Content term can be interpreted as $token_i$ saying, I am at position $i$, which words should I look at in the input sentence wrt this position$i\$ so that I can be better at predicting masked tokens

Position to Position term is not that useful because we are talking about relative positions

Thus in this way, By feeding in relative position information at each step and keeping it separate from context information, DeBerta can gather more information about the words as well as their relative positions.