Let’s begin by looking at the model as a single black box. In a machine translation application, it would take a sentence in one language, and output its translation in another, as illustrated below,
Getting closer into the black box, transformers have on the inside:
- An encoding component: which is a stack of
N
encoders.
- A decoding component: which is a stack of
N
decoders,
- and connections between them.
Now, each encoder is broken down into two sub-layers: the self-attention layer and the feed-forward neural network layer.
The inputs first flow through a self-attention layer, and the outputs of the self-attention layer are fed to a feed-forward neural network. And this sequence is repeated till reaches the last encoder.
Finally, the decoder receives the output of the encoder component and also has both the self-attention layer and feed-forward layer, and the flow is similar to before, but between them there is an attention layer that helps the decoder focus on relevant parts of the input sentence.