How is perplexity different from accuracy?

Perplexity is a continuous measure based on probability assigned to the true tokens, whereas accuracy counts exact matches and ignores probability values.

Can perplexity be used for non-language tasks?

It is mainly defined for sequence models that output probability distributions over discrete tokens.

What is Perplexity?

Perplexity is an evaluation metric that measures how well a language model predicts a given sequence of text. Lower values indicate the model is less surprised by the data and thus performs better at next-token prediction.

Perplexity is computed as the exponential of the average negative log-likelihood of the tokens in a test set. It can be interpreted as the effective number of choices the model considers at each step.

A lower perplexity means the model assigns higher probability to the correct next words, reflecting stronger predictive performance. It is closely related to cross-entropy loss but expressed in a more intuitive units.

While widely used for comparing language models, perplexity only captures next-token prediction and does not directly measure downstream task performance or generation quality.

Example

If a model has a perplexity of 20 on a news corpus, it is roughly as uncertain as if it were choosing among 20 equally likely words at every position.

Why it matters

Perplexity remains a standard, computationally cheap benchmark for pretraining progress in large language models and enables quick comparisons across architectures and datasets.

Frequently asked questions

It indicates better next-token prediction on the evaluated data, but does not guarantee better performance on real-world tasks.

Related terms

Accuracy

Accuracy measures the proportion of correct predictions made by a machine learning model out of all predictions. It is calculated as the number of correct predictions divided by the total number of predictions.

Benchmark

A benchmark is a standardized dataset and task used to measure and compare how well different AI models perform.

BLEU Score

BLEU Score is an automatic metric that evaluates machine-generated text quality, mainly for machine translation, by measuring overlap with human-written reference translations.

Confusion Matrix

A confusion matrix is a table that shows how well a classification model performs by comparing its predictions to the actual labels.

F1 Score

The F1 Score is a single metric that balances precision and recall to evaluate how well a classification model performs, especially when classes are uneven.

Precision

Precision is an evaluation metric for classification models that measures the proportion of true positive predictions among all positive predictions made.