The pros and cons of BLEU - a popular metric for machine translation

quangngoc

BLEU (Bilingual Evaluation Understudy) is a widely used metric for evaluating the quality of machine translation outputs, especially in the context of automated translation systems. While BLEU has several advantages, it also has some limitations. Here are the pros and cons of using BLEU as an evaluation metric:

Pros of BLEU:

Ease of Computation: BLEU is relatively easy to compute, making it a fast and efficient metric for assessing the quality of machine translation systems. It relies on simple n-gram precision calculations.
Correlation with Human Judgments: BLEU scores have been found to correlate reasonably well with human judgments of translation quality, especially when comparing multiple translations.
Widely Adopted: BLEU has become a standard metric for machine translation evaluation, and its widespread use allows for consistent and comparable evaluations across different systems and datasets.
Availability: BLEU implementations are readily available, making it accessible for researchers and practitioners in the field of machine translation.
N-gram Flexibility: BLEU can be customized to consider different n-gram lengths (e.g., unigrams, bigrams, etc.) based on the specific requirements of a translation task.

Cons of BLEU:

N-gram Matching Emphasis: BLEU primarily focuses on n-gram precision, which measures the overlap of n-grams (subsequences of words) between the reference and candidate translations. It may not capture global translation quality, fluency, or coherence.
Sensitivity to Length: BLEU heavily penalizes translations with different lengths compared to reference translations. This can be problematic when evaluating translations between languages with different word orders or morphologies.
Lack of Semantic Understanding: BLEU is insensitive to the semantic accuracy or meaningfulness of translations. It can assign high scores to translations that are syntactically correct but semantically incorrect.
Reference Dependency: BLEU relies on one or more reference translations to compute scores. The choice of reference translations can significantly impact the evaluation, and it may not always be possible to have multiple high-quality references.
Sparse Data Issues: BLEU can be sensitive to the scarcity of specific n-grams in the reference translations, leading to low scores for translations that are not identical to the references, even if they are perfectly acceptable translations.
Inadequate for Evaluation of Creative Texts: For tasks requiring creative or paraphrased translations (e.g., poetry, marketing copy), BLEU may not provide meaningful evaluations, as it emphasizes exact matches with reference texts.
Lack of Discrimination: BLEU may not effectively discriminate between translations of varying quality, especially when the translations are close in quality, leading to high scores for marginally better translations.

In practice, BLEU is a valuable metric for initial screening and comparison of machine translation systems, but it should not be the sole metric used. Researchers and practitioners often complement BLEU with other metrics and human evaluation to gain a more comprehensive understanding of translation quality. Modern machine translation evaluation also includes metrics like METEOR, TER, ROUGE, and human evaluations to address some of BLEU's limitations and provide a more holistic assessment of translation quality.