What value of p should I choose?

Values between 0.8 and 0.95 are common; lower p produces safer, more repetitive text while higher p increases creativity and risk of incoherence.

Does top-p replace the need for temperature?

No; temperature is often still applied to sharpen or soften the distribution before top-p is computed.

What is Top-p Sampling?

Also known as: Nucleus Sampling

Top-p sampling (nucleus sampling) is a text-generation technique that dynamically selects the smallest set of most likely next tokens whose combined probability exceeds a threshold p (e.g. 0.9), then samples from that set.

Traditional fixed-size methods like top-k always keep the same number of candidates. Top-p instead looks at the model's probability distribution and keeps adding tokens in descending order until their cumulative probability reaches p, forming a variable-sized 'nucleus'.

This approach automatically adapts to the model's confidence: when the distribution is peaked, fewer tokens are kept; when it is flatter, more tokens are included, helping balance coherence and diversity.

The parameter p (typically 0.8–0.95) controls the trade-off; lower p yields more focused output while higher p allows greater variety.

Example

When generating the next word after 'The cat sat on the', the model might assign high probability to 'mat' and 'sofa'. With p=0.9 the nucleus might contain only those two tokens, so the model samples between them rather than risking a low-probability word like 'airplane'.

Why it matters

Top-p sampling is widely used in modern LLMs because it produces more coherent yet varied text than greedy or fixed-k decoding, improving the quality of chatbots, story generators, and other creative applications.

Frequently asked questions

Top-k always keeps a fixed number k of tokens; top-p keeps a variable number whose probabilities sum to at least p, adapting to each prediction.

Related terms

Greedy Decoding

Greedy decoding is a text generation strategy in NLP where, at each step, the model selects the single token with the highest probability as the next output.

Hallucination

In LLMs, hallucination is when the model generates fluent, confident text that is factually incorrect, fabricated, or not supported by its training data.

Retrieval-Augmented Generation

Retrieval-Augmented Generation (RAG) is a technique that improves large language models by retrieving relevant external information before generating a response.

Temperature

Temperature is a parameter in large language models that controls the randomness of generated text. Lower values produce more focused and deterministic outputs, while higher values increase creativity and variability.