To create sequence-to-sequence models before the Transformer, we used the famous LSTM with its Encoder-Decoder architecture, where
- The "Encoder" part that creates a vector representation of a sequence of words.
- The "Decoder" returns a sequence of words from the vector representation.
The LSTM model takes into account the interdependence of words, so we need inputs of the previous state to make any operations on the current state. This model has a limitation: it is relatively slow to train and the input sequence can't be passed in parallel.
Now, the idea of the Transformer is to maintain the interdependence of the words in a sequence without using a recurrent network but only the attention mechanism that is at the center of its architecture. The attention measures how closely two elements of two sequences are related.
In transformer-based architectures, the attention mechanism is applied to a single sequence (also known as a self-attention layer). The self-attention layer determines the interdependence of different words in the same sequence, to associate a relevant representation with it. Take for example the sentence: "The dog didn't cross the street because it was too tired". It is obvious to a human being that "it" refers to the "dog" and not to the "street". The objective of the self-attention process will therefore be to detect the link between "dog" and "it". This feature makes transformers much faster to train compared to their predecessors, and they have been proven to be more robust against noisy and missing data.
As a plus, in contextual embeddings, transformers can draw information from the context to correct missing or noisy data and that is something that other neural networks couldn’t offer.