What is Context Window?
A context window is the maximum number of tokens an LLM can process together in one pass, including the user's input and any conversation history.
It defines the span of text the model can attend to when making predictions, enforced by the transformer's fixed positional embeddings and attention layers.
Exceeding the window forces truncation of older tokens, so the model literally loses access to that information for the current generation step.
Recent models expand windows to 128k–1M tokens, but memory and compute costs grow quadratically with size.
Example
If a user pastes a 10,000-word document into a model whose window holds only 4,000 tokens, the earliest paragraphs are dropped and the model cannot reference them when answering questions.
Why it matters
Context-window size sets hard limits on coherent long conversations, document analysis, and agent memory, directly affecting real-world usefulness of LLMs today.
Frequently asked questions
They range from a few thousand tokens in older models to 128k–1M+ tokens in current frontier models.
Related terms
A token is the basic unit of text that an LLM reads and generates. It may be a whole word, part of a word, or punctuation, depending on the model's tokenizer.
A Transformer is a neural network architecture that processes sequential data like text using self-attention to weigh relationships between all parts of the input at once.
The attention mechanism is a technique in neural networks that lets the model dynamically focus on the most relevant parts of the input when processing each element, rather than treating all inputs equally.
Positional encoding adds information about token order to input embeddings in transformer models, which lack any built-in sense of sequence because they process tokens in parallel.
A prompt is the input text, question, or instruction given to an AI model (especially a large language model) to guide what it should generate or how it should respond.
Retrieval-Augmented Generation (RAG) is a technique that improves large language models by retrieving relevant external information before generating a response.