Can you provide a high-level overview of Transformers' architecture?

quangngoc

Let’s begin by looking at the model as a single black box. In a machine translation application, it would take a sentence in one language, and output its translation in another, as illustrated below,

Getting closer into the black box, transformers have on the inside:

An encoding component: which is a stack of N encoders.
A decoding component: which is a stack of N decoders,
and connections between them.

Now, each encoder is broken down into two sub-layers: the self-attention layer and the feed-forward neural network layer.

The inputs first flow through a self-attention layer, and the outputs of the self-attention layer are fed to a feed-forward neural network. And this sequence is repeated till reaches the last encoder.

Finally, the decoder receives the output of the encoder component and also has both the self-attention layer and feed-forward layer, and the flow is similar to before, but between them there is an attention layer that helps the decoder focus on relevant parts of the input sentence.

quangngoc

Sure! The Transformer architecture, introduced in "Attention Is All You Need" by Vaswani et al., revolutionized the field of natural language processing by providing an alternative to the traditional recurrent neural network (RNN) approaches to sequence-to-sequence tasks. Here's a high-level overview of the architecture:

Main Components:

Encoder-Decoder Structure:
- The Transformer model follows an encoder-decoder structure, where the encoder processes the input sequence, and the decoder generates the output sequence one element at a time.
Stacked Layers:
- Both the encoder and decoder are composed of stacked identical layers (usually 6 or more). Each encoder layer has two core sub-layers: a self-attention mechanism and a position-wise, fully connected feed-forward network. Each decoder layer has the same sub-layers but adds a third one, which performs attention over the encoder's output.
Self-Attention Mechanism:
- Self-attention allows each position in the input sequence to attend to all positions in the same input sequence. This is a key innovation, as it lets the model directly compute dependencies between all tokens, regardless of their positions or distance from each other.
Multi-Head Attention:
- Within the self-attention mechanism, Multi-Head Attention runs multiple attention operations in parallel, allowing the model to capture information from different representation subspaces and focus on different parts of the input sequence simultaneously.
Positional Encoding:
- Since the Transformer does not inherently capture sequence order (unlike RNNs or LSTMs), it adds "positional encodings" to the input embeddings to provide information about the position of the tokens in the sequence.
Layer Normalization and Residual Connections:
- Each sub-layer in both the encoder and decoder has a residual connection around it, followed by layer normalization. These components help stabilize training and allow for deeper architectures by combating the vanishing gradient problem.

Encoder:

The encoder maps an input sequence of symbol representations (e.g., word embeddings) to a sequence of continuous representations. These representations are refined through each layer via self-attention and feed-forward networks, with residual connections and normalization applied at each step.

Decoder:

The decoder is responsible for generating the output sequence. It also processes the sequence through stacked layers that include self-attention, but uses "masked" attention to prevent positions from attending to subsequent positions, thereby maintaining the auto-regressive property. The decoder also uses encoder-decoder attention to focus on the relevant parts of the input sequence.

Final Output:

The output of the top layer of the decoder is transformed into a predicted sequence of tokens, typically through a linear layer followed by a softmax function to generate probabilities over the output vocabulary.

Why Transformers are Powerful:

Parallelizable computation, unlike the sequential computation of RNNs, which allows for faster training.
Ability to handle long-range dependencies due to the self-attention mechanism.
Flexibility in modeling various NLP tasks, from translation to text classification and beyond.
Scalability and effectiveness in capturing complex language patterns across different languages and domains.

The Transformer architecture has become the backbone of many subsequent models, such as BERT, GPT, and T5, which usually pre-train on large corpora using variants of the Transformer's architecture and then fine-tune for specific tasks.

quangngoc

The "masked" attention mechanism in the decoder of a Transformer model is a crucial component that preserves the auto-regressive property when generating sequences. In an auto-regressive model, each output element (e.g., a word or token) is predicted one at a time, and the prediction of each element depends only on previously generated elements. This ensures that during generation, the model does not have access to future tokens that it's not supposed to see.

In the context of the Transformer, here's how masked attention works step by step:

Masking Future Tokens:
- When processing the input sequence in the decoder, masked attention applies a mask to the future tokens. This is typically done by setting the attention weights to negative infinity for these positions before applying the softmax function. As a result, after the softmax, the attention weights for future positions are virtually zero, ensuring that these positions do not contribute to the output.
Preserving Auto-regressiveness:
- By masking future tokens, the decoder ensures that the prediction for position (i) depends only on the known outputs at positions less than (i). Thus, the model maintains the auto-regressive property, generating the sequence from left to right, or one token at a time.
Attention Weight Calculation:
- For a given position, the self-attention mechanism in the decoder computes the dot product of the query (the token at that position) with the keys (all positions up to and including that position). The masking ensures that the dot products with future positions do not affect the result.
Training vs Inference:
- During training, the Transformer uses look-ahead masking so the model can be trained on full sequences using parallel computation. This means that an output at position (i) is trained to be dependent only on positions before (i).
- During inference, the model generates one token at a time, and only previously generated tokens are used as input to the decoder to generate the next token.

Masked attention is integral to the Transformer decoder because it constrains the model to respect the sequential nature of the task, such as language modeling or text generation, where each token should be generated based on a prefix of tokens that have already been generated. This mechanism contrasts with the self-attention in the Transformer encoder and the decoder's encoder-decoder attention, where there is no need for masking because the entire input sequence is available at once.