Take for example the sentence:
Bark is very cute and he is a dog.
Here, if we take the word ‘dog
’, grammatically we understand that the words ‘Bark
’, ‘cute
’, and ‘he
’ should have some significance or relevance with the word ‘dog
’. These words say that the dog’s name is Bark, it is a male dog, and that he is a cute dog.
In simple terms, just one attention mechanism may not be able to correctly identify these three words as relevant to ‘dog
’, and we can sense that three attentions are better here to signify the three words with the word ‘dog
’.
Therefore, to overcome some of the pitfalls of using single attention, multi-head attention is used. This reduces the load on one attention to find all significant words and also increases the chances of finding more relevant words easily.