Skip to content
Sign in

What is Attention Mechanism?

Also known as: Self-Attention

The attention mechanism is a technique in neural networks that lets the model dynamically focus on the most relevant parts of the input when processing each element, rather than treating all inputs equally.

It works by computing similarity scores (attention weights) between a query vector and all key vectors from the input sequence. These weights determine how much each value vector contributes to the output representation for that position.

In self-attention, queries, keys, and values all come from the same sequence, allowing every token to attend to every other token (including itself). This is typically implemented via scaled dot-product attention followed by a softmax.

Multi-head attention runs several attention operations in parallel to capture different types of relationships, then combines the results.

Example

When translating 'The cat sat on the mat' to French, the model uses attention to strongly link the English word 'cat' with the French word 'chat' while down-weighting less relevant words like 'on'.

Why it matters

Attention is the core building block of the Transformer architecture that powers nearly all modern large language models, enabling efficient parallel training and better handling of long-range dependencies than RNNs.

Frequently asked questions

Attention generally refers to encoder-decoder attention where queries come from the decoder and keys/values from the encoder; self-attention uses queries, keys, and values from the same sequence.