Why do we need positional encodings with attention?

Because attention itself has no notion of order, positional encodings are added to input embeddings so the model can distinguish word positions.

Is attention only used in language models?

No, attention is also widely used in computer vision (Vision Transformers), speech recognition, and multimodal models.

What is Attention Mechanism?

Also known as: Self-Attention

The attention mechanism is a technique in neural networks that lets the model dynamically focus on the most relevant parts of the input when processing each element, rather than treating all inputs equally.

It works by computing similarity scores (attention weights) between a query vector and all key vectors from the input sequence. These weights determine how much each value vector contributes to the output representation for that position.

In self-attention, queries, keys, and values all come from the same sequence, allowing every token to attend to every other token (including itself). This is typically implemented via scaled dot-product attention followed by a softmax.

Multi-head attention runs several attention operations in parallel to capture different types of relationships, then combines the results.

Example

When translating 'The cat sat on the mat' to French, the model uses attention to strongly link the English word 'cat' with the French word 'chat' while down-weighting less relevant words like 'on'.

Why it matters

Attention is the core building block of the Transformer architecture that powers nearly all modern large language models, enabling efficient parallel training and better handling of long-range dependencies than RNNs.

Frequently asked questions

Attention generally refers to encoder-decoder attention where queries come from the decoder and keys/values from the encoder; self-attention uses queries, keys, and values from the same sequence.

Related terms

Transformer

A Transformer is a neural network architecture that processes sequential data like text using self-attention to weigh relationships between all parts of the input at once.

Positional Encoding

Positional encoding adds information about token order to input embeddings in transformer models, which lack any built-in sense of sequence because they process tokens in parallel.

Recurrent Neural Network

A Recurrent Neural Network (RNN) is a type of neural network built to handle sequential data by passing information from one step to the next through a hidden state that acts like a memory.

Embedding

An embedding (or vector embedding) is a way to represent words, sentences, or other data as dense numerical vectors in a high-dimensional space so that similar items end up close together.

Tokenization

Tokenization is the process of breaking text into smaller units called tokens that language models can process numerically.

Context Length

Context length is the maximum number of tokens an LLM can process in a single input at once, acting as its effective memory window.