What is Attention Mechanism?
Also known as: Self-Attention
The attention mechanism is a technique in neural networks that lets the model dynamically focus on the most relevant parts of the input when processing each element, rather than treating all inputs equally.
It works by computing similarity scores (attention weights) between a query vector and all key vectors from the input sequence. These weights determine how much each value vector contributes to the output representation for that position.
In self-attention, queries, keys, and values all come from the same sequence, allowing every token to attend to every other token (including itself). This is typically implemented via scaled dot-product attention followed by a softmax.
Multi-head attention runs several attention operations in parallel to capture different types of relationships, then combines the results.
Example
When translating 'The cat sat on the mat' to French, the model uses attention to strongly link the English word 'cat' with the French word 'chat' while down-weighting less relevant words like 'on'.
Why it matters
Attention is the core building block of the Transformer architecture that powers nearly all modern large language models, enabling efficient parallel training and better handling of long-range dependencies than RNNs.
Frequently asked questions
Attention generally refers to encoder-decoder attention where queries come from the decoder and keys/values from the encoder; self-attention uses queries, keys, and values from the same sequence.
Related terms
A Transformer is a neural network architecture that processes sequential data like text using self-attention to weigh relationships between all parts of the input at once.
Positional encoding adds information about token order to input embeddings in transformer models, which lack any built-in sense of sequence because they process tokens in parallel.
A Recurrent Neural Network (RNN) is a type of neural network built to handle sequential data by passing information from one step to the next through a hidden state that acts like a memory.
An embedding (or vector embedding) is a way to represent words, sentences, or other data as dense numerical vectors in a high-dimensional space so that similar items end up close together.
Tokenization is the process of breaking text into smaller units called tokens that language models can process numerically.
Context length is the maximum number of tokens an LLM can process in a single input at once, acting as its effective memory window.