quangngoc Encoder models: They use only the encoder of a Transformer model. At each stage, the attention layers can access all the words in the initial sentence. The pretraining of these models usually revolves around somehow corrupting a given sentence (for instance, by masking random words in it) and tasking the model with finding or reconstructing the initial sentence. They are best suited for tasks requiring an understanding of the full sentence, such as sentence classification, named entity recognition (and more general word classification), and extractive question answering. Decoder models: They use only the decoder of a Transformer model. At each stage, for a given word the attention layers can only access the words positioned before it in the sentence. The pretraining of decoder models usually revolves around predicting the next word in the sentence. They are best suited for tasks involving text generation.