What's the difference between Global and Local Attention in LLMs?

quangngoc

Consider the example sentence “Where is Wally” which should be translated to its Italian counterpart “Dove è Wally”. In the transformer architecture, the encoder processes the input word by word, producing three different hidden states.

Then, the attention layer produces a single fixed-size context vector from all the encoder hidden states (often with a weighted sum) and it represents the “attention” that must be given to that context when processing such input word. Here is when global and local attention comes into play.

Global attention considers all the hidden states in creating the context vector. When is applied, a lot of computation occurs. This is because all the hidden states must be taken into consideration, concatenated into a matrix, and processed by a NN to compute their weights.

On the other hand, local attention considers only a subset of all the hidden states in creating the context vector. The subset can be obtained in many different ways, such as with Monotonic Alignment and Predictive Alignment.

quangngoc

In the context of Large Language Models (LLMs) and sequence modeling tasks, attention mechanisms play a crucial role in determining which parts of the input data should be focused on at each step of processing. The terms 'global attention' and 'local attention' refer to two different ways an attention mechanism can operate:

Global Attention:
- In a global attention model, every output element (e.g., each word in a translation task) is allowed to attend to all positions in the input sequence when computing the attention weights.
- This means that no matter where you are in the output sequence, you can access the full context of the input. This type of attention has a global view and ensures that every decision takes into account all available information.
- The Transformer model's attention mechanism is an example of global attention, as it computes attention weights between all pairs of input and output positions.
- While global attention is powerful for capturing long-range dependencies and maintaining comprehensive context, it also tends to be computationally expensive, especially for long sequences, because it requires computing and storing attention scores for every pair of input and output tokens.
Local Attention:
- Local attention models restrict the focus of the attention mechanism to a subset of the input positions near the current position in the output sequence.
- This can be done in a fixed window (hard local attention) where the model only looks at a predetermined segment of the input around the current output token or in a learnable window (soft local attention) where the central position and width of the local window can change dynamically based on the data.
- Local attention reduces computational complexity since fewer comparisons are made. This can make it more efficient for very long sequences where global attention may be less tractable.
- However, when using local attention, there is a trade-off, as the model might miss important context that falls outside the limited window of focus, which could lead to suboptimal performance if the reduced context is insufficient for the task at hand.

Both global and local attention mechanisms aim to make neural networks more effective by allowing them to focus on relevant parts of the input when making predictions. The choice between the two often depends on the particularities of the task, the length of input sequences, and the available computational resources. For example, a model working with very long documents might incorporate a local attention mechanism to maintain efficiency, while a model working on shorter sequences where detailed global context is important might leverage global attention.