There are two ways to evaluate language models in NLP: Extrinsic evaluation and Intrinsic evaluation.
- Intrinsic evaluation captures how well the model captures what it is supposed to capture, like probabilities.
- Extrinsic evaluation (or task-based evaluation) captures how useful the model is in a particular task.
A common intrinsic evaluation of LM is the perplexity. It's a geometric average of the inverse probability of words predicted by the model. Intuitively, perplexity means to be surprised. We measure how much the model is surprised by seeing new data. The lower the perplexity, the better the training is. Another common measure is the cross-entropy, which is the Logarithm (base 2
) of perplexity. As a thumb rule, a reduction of 10-20%
in perplexity is noteworthy.
The extrinsic evaluation will depend on the task. Example: For speech recognition, we can compare the performance of two language models by running the speech recognizer twice, once with each language model, and seeing which gives the more accurate transcription.