Skip to content
Sign in

What is Perplexity?

Perplexity is an evaluation metric that measures how well a language model predicts a given sequence of text. Lower values indicate the model is less surprised by the data and thus performs better at next-token prediction.

Perplexity is computed as the exponential of the average negative log-likelihood of the tokens in a test set. It can be interpreted as the effective number of choices the model considers at each step.

A lower perplexity means the model assigns higher probability to the correct next words, reflecting stronger predictive performance. It is closely related to cross-entropy loss but expressed in a more intuitive units.

While widely used for comparing language models, perplexity only captures next-token prediction and does not directly measure downstream task performance or generation quality.

Example

If a model has a perplexity of 20 on a news corpus, it is roughly as uncertain as if it were choosing among 20 equally likely words at every position.

Why it matters

Perplexity remains a standard, computationally cheap benchmark for pretraining progress in large language models and enables quick comparisons across architectures and datasets.

Frequently asked questions

It indicates better next-token prediction on the evaluated data, but does not guarantee better performance on real-world tasks.