What is Perplexity?
Perplexity is an evaluation metric that measures how well a language model predicts a given sequence of text. Lower values indicate the model is less surprised by the data and thus performs better at next-token prediction.
Perplexity is computed as the exponential of the average negative log-likelihood of the tokens in a test set. It can be interpreted as the effective number of choices the model considers at each step.
A lower perplexity means the model assigns higher probability to the correct next words, reflecting stronger predictive performance. It is closely related to cross-entropy loss but expressed in a more intuitive units.
While widely used for comparing language models, perplexity only captures next-token prediction and does not directly measure downstream task performance or generation quality.
Example
If a model has a perplexity of 20 on a news corpus, it is roughly as uncertain as if it were choosing among 20 equally likely words at every position.
Why it matters
Perplexity remains a standard, computationally cheap benchmark for pretraining progress in large language models and enables quick comparisons across architectures and datasets.
Frequently asked questions
It indicates better next-token prediction on the evaluated data, but does not guarantee better performance on real-world tasks.
Related terms
Accuracy measures the proportion of correct predictions made by a machine learning model out of all predictions. It is calculated as the number of correct predictions divided by the total number of predictions.
A benchmark is a standardized dataset and task used to measure and compare how well different AI models perform.
BLEU Score is an automatic metric that evaluates machine-generated text quality, mainly for machine translation, by measuring overlap with human-written reference translations.
A confusion matrix is a table that shows how well a classification model performs by comparing its predictions to the actual labels.
The F1 Score is a single metric that balances precision and recall to evaluate how well a classification model performs, especially when classes are uneven.
Precision is an evaluation metric for classification models that measures the proportion of true positive predictions among all positive predictions made.