Consider the example sentence “Where is Wally” which should be translated to its Italian counterpart “Dove è Wally”. In the transformer architecture, the encoder processes the input word by word, producing three different hidden states.
Then, the attention layer produces a single fixed-size context vector from all the encoder hidden states (often with a weighted sum) and it represents the “attention” that must be given to that context when processing such input word. Here is when global and local attention comes into play.
Global attention considers all the hidden states in creating the context vector. When is applied, a lot of computation occurs. This is because all the hidden states must be taken into consideration, concatenated into a matrix, and processed by a NN to compute their weights.
On the other hand, local attention considers only a subset of all the hidden states in creating the context vector. The subset can be obtained in many different ways, such as with Monotonic Alignment and Predictive Alignment.