Skip to content
Sign in

What is BLEU Score?

BLEU Score is an automatic metric that evaluates machine-generated text quality, mainly for machine translation, by measuring overlap with human-written reference translations.

BLEU computes modified n-gram precision (for 1- to 4-grams) between the candidate sentence and one or more references, then applies a brevity penalty to discourage overly short outputs.

The final score is the geometric mean of these precisions multiplied by the brevity penalty, producing a value between 0 and 1 where higher indicates better match to human references.

It is fast, language-independent, and correlates reasonably with human judgments, making it a standard benchmark despite known limitations such as ignoring meaning and synonymy.

Example

If the reference translation is 'the cat sat on the mat' and the model outputs 'a cat is on the mat', BLEU will award partial credit for shared bigrams like 'cat on' and 'on the' while applying a small brevity penalty.

Why it matters

BLEU remains the most widely reported automatic score for comparing translation and text-generation models, enabling reproducible progress tracking across research papers and leaderboards.

Frequently asked questions

Bilingual Evaluation Understudy.