What is Temperature?
Temperature is a parameter in large language models that controls the randomness of generated text. Lower values produce more focused and deterministic outputs, while higher values increase creativity and variability.
It works by scaling the model's raw output scores (logits) before they are converted into probabilities via the softmax function. A temperature below 1 sharpens the distribution toward high-probability tokens, while a value above 1 flattens it to allow lower-probability tokens more chance of selection.
At temperature 0 the model always picks the single most likely next token (greedy decoding). At the default value of 1 the model samples from its normal learned distribution. Values greater than 1 make unlikely tokens more probable, producing more diverse but sometimes less coherent text.
The key idea is a simple trade-off between coherence and diversity that lets users tune generation style without retraining the model.
Example
When writing a product description, temperature 0.2 yields safe, repetitive phrasing, while temperature 1.2 may introduce unexpected metaphors or unusual word choices.
Why it matters
Temperature gives users direct control over output style, enabling the same model to handle factual Q&A, creative storytelling, or code generation by simply adjusting one number.
Frequently asked questions
It forces the model to always choose the single highest-probability token, producing the most deterministic and repeatable output.
Related terms
Top-p sampling (nucleus sampling) is a text-generation technique that dynamically selects the smallest set of most likely next tokens whose combined probability exceeds a threshold p (e.g. 0.9), then samples from that set.
Greedy decoding is a text generation strategy in NLP where, at each step, the model selects the single token with the highest probability as the next output.
In LLMs, hallucination is when the model generates fluent, confident text that is factually incorrect, fabricated, or not supported by its training data.
Retrieval-Augmented Generation (RAG) is a technique that improves large language models by retrieving relevant external information before generating a response.