What is Positional Encoding?
Positional encoding adds information about token order to input embeddings in transformer models, which lack any built-in sense of sequence because they process tokens in parallel.
Transformers rely on self-attention, which treats every token equally regardless of position. Without extra signals, the model cannot tell whether 'cat sat on mat' differs from 'mat on sat cat'.
The standard solution adds a fixed or learned vector to each token embedding. Sinusoidal encodings use sine and cosine functions of different frequencies so the model can easily learn relative positions; learned encodings treat position as another trainable embedding.
Because the added vectors are unique per position yet allow arithmetic operations, attention heads can later discover patterns such as 'word two positions after a verb'.
Example
In the sentence 'I saw a cat', the word 'saw' receives a different positional vector than it would in 'I saw a saw', letting the model know the first 'saw' is a verb and the second is a noun.
Why it matters
Positional encoding is what lets transformers scale to long contexts and become the backbone of every modern LLM, replacing recurrent networks for nearly all sequence tasks.
Frequently asked questions
Because all tokens are processed simultaneously, the architecture itself has no notion of 'before' or 'after' unless explicit position information is supplied.
Related terms
A Transformer is a neural network architecture that processes sequential data like text using self-attention to weigh relationships between all parts of the input at once.
Word embedding is a technique that represents words as dense numerical vectors in a continuous space, allowing machines to capture semantic relationships between words.
The attention mechanism is a technique in neural networks that lets the model dynamically focus on the most relevant parts of the input when processing each element, rather than treating all inputs equally.
Context length is the maximum number of tokens an LLM can process in a single input at once, acting as its effective memory window.
A context window is the maximum number of tokens an LLM can process together in one pass, including the user's input and any conversation history.
A foundation model is a large-scale AI model trained on massive, diverse datasets that can be adapted to perform many different tasks with minimal additional training.