What is Top-p Sampling?
Also known as: Nucleus Sampling
Top-p sampling (nucleus sampling) is a text-generation technique that dynamically selects the smallest set of most likely next tokens whose combined probability exceeds a threshold p (e.g. 0.9), then samples from that set.
Traditional fixed-size methods like top-k always keep the same number of candidates. Top-p instead looks at the model's probability distribution and keeps adding tokens in descending order until their cumulative probability reaches p, forming a variable-sized 'nucleus'.
This approach automatically adapts to the model's confidence: when the distribution is peaked, fewer tokens are kept; when it is flatter, more tokens are included, helping balance coherence and diversity.
The parameter p (typically 0.8–0.95) controls the trade-off; lower p yields more focused output while higher p allows greater variety.
Example
When generating the next word after 'The cat sat on the', the model might assign high probability to 'mat' and 'sofa'. With p=0.9 the nucleus might contain only those two tokens, so the model samples between them rather than risking a low-probability word like 'airplane'.
Why it matters
Top-p sampling is widely used in modern LLMs because it produces more coherent yet varied text than greedy or fixed-k decoding, improving the quality of chatbots, story generators, and other creative applications.
Frequently asked questions
Top-k always keeps a fixed number k of tokens; top-p keeps a variable number whose probabilities sum to at least p, adapting to each prediction.
Related terms
Greedy decoding is a text generation strategy in NLP where, at each step, the model selects the single token with the highest probability as the next output.
In LLMs, hallucination is when the model generates fluent, confident text that is factually incorrect, fabricated, or not supported by its training data.
Retrieval-Augmented Generation (RAG) is a technique that improves large language models by retrieving relevant external information before generating a response.
Temperature is a parameter in large language models that controls the randomness of generated text. Lower values produce more focused and deterministic outputs, while higher values increase creativity and variability.