Why do transformers need Positional Encodings?

quangngoc

Consider the input sentence - "I am good".

In RNNs, we feed the sentence to the network word by word. That is, first the word "I" is passed as input, next the word "am" is passed, and so on. We feed the sentence word by word so that our network understands the sentence completely.

But with the transformer network, we don't follow the recurrence mechanism. So, instead of feeding the sentence word by word, we feed all the words in the sentence parallel to the network. Feeding the words in parallel helps in decreasing the training time and also helps in learning the long-term dependency.

When we feed the words parallel to the transformer, the word order (position of the words in the sentence) is important. So, we should give some information about the word order to the transformer so that it can understand the sentence.

If we pass the input matrix directly to the transformer, it cannot understand the word order. So, instead of feeding the input matrix directly to the transformer, we need to add some information indicating the word order (position of the word) so that our network can understand the meaning of the sentence. To do this, we introduce a technique called positional encoding. Positional encoding, as the name suggests, is an encoding indicating the position of the word in a sentence (word order).

quangngoc

Transformers, unlike recurrent neural networks (RNNs), do not process data in a sequential manner and therefore lack an inherent notion of the order or position of words in a sentence. This is a drawback when dealing with language where the positioning of words is fundamental to meaning.

Positional Encoding is used to give Transformers some information about the relative positions of the words in the sentence. It adds additional information at the input stage to specify the position of each word, or the distance between different words in the sequence.

By incorporating positional encodings, the transformer model is able to respect word order and understand aspects like grammar and context better, thus improving its ability to understand and generate coherent text.