Why do some models advertise 128k context lengths?

It allows them to process much longer documents or conversations without forgetting details from the start.

What is Context Length?

Context length is the maximum number of tokens an LLM can process in a single input at once, acting as its effective memory window.

It determines how much preceding text the model can reference when generating a response, directly shaping coherence over long inputs.

Measured in tokens rather than words, it is constrained by the transformer's attention mechanism and available compute during inference.

Exceeding the limit forces truncation or summarization, while techniques like sliding windows or sparse attention aim to extend it.

Example

A model with a 4k context length can read and answer questions about a short article, but a 128k model can handle an entire novel in one go.

Why it matters

Larger context lengths enable LLMs to manage lengthy documents, multi-turn conversations, and complex reasoning tasks that are central to real-world applications today.

Frequently asked questions

The model typically truncates older or excess tokens, which can cause it to lose important earlier information.

Related terms

Token

A token is the basic unit of text that an LLM reads and generates. It may be a whole word, part of a word, or punctuation, depending on the model's tokenizer.

Transformer

A Transformer is a neural network architecture that processes sequential data like text using self-attention to weigh relationships between all parts of the input at once.

Attention Mechanism

The attention mechanism is a technique in neural networks that lets the model dynamically focus on the most relevant parts of the input when processing each element, rather than treating all inputs equally.

Prompt

A prompt is the input text, question, or instruction given to an AI model (especially a large language model) to guide what it should generate or how it should respond.

Inference

Inference is the stage where a trained machine learning model is used to generate predictions or outputs on new, unseen data. In infrastructure contexts, it focuses on efficiently deploying and serving models in production.

Embedding

An embedding (or vector embedding) is a way to represent words, sentences, or other data as dense numerical vectors in a high-dimensional space so that similar items end up close together.