What is self-attention in simple terms?

Self-attention lets each word 'look at' every other word in the input to decide how much importance to assign, capturing context without needing recurrence.

Why do Transformers need positional encodings?

Because attention treats inputs as a set, positional encodings add information about word order so the model knows the sequence structure.

What is Transformer?

A Transformer is a neural network architecture that processes sequential data like text using self-attention to weigh relationships between all parts of the input at once.

It replaces older sequential models like RNNs with parallel processing via attention mechanisms, allowing the model to focus on relevant words regardless of their position in a sentence.

Key components include multi-head self-attention layers, feed-forward networks, positional encodings to retain order information, and often an encoder-decoder structure.

This design enables efficient training on massive datasets and scales well to large models used in modern language tasks.

Example

In a sentence like 'The cat sat on the mat because it was tired,' a Transformer can directly connect 'it' to 'cat' through attention scores, helping generate accurate translations or answers.

Why it matters

Transformers power nearly all state-of-the-art LLMs today, enabling breakthroughs in chatbots, translation, and content generation by handling long contexts efficiently at scale.

Frequently asked questions

Transformers process all input tokens in parallel using attention, while RNNs handle them one by one sequentially, making Transformers much faster to train on long sequences.

Related terms

Attention Mechanism

The attention mechanism is a technique in neural networks that lets the model dynamically focus on the most relevant parts of the input when processing each element, rather than treating all inputs equally.

Positional Encoding

Positional encoding adds information about token order to input embeddings in transformer models, which lack any built-in sense of sequence because they process tokens in parallel.

Encoder-Decoder

An Encoder-Decoder is a neural network architecture that uses one model (the encoder) to compress input data into a compact representation and a second model (the decoder) to generate output from that representation.

Context Length

Context length is the maximum number of tokens an LLM can process in a single input at once, acting as its effective memory window.

Context Window

A context window is the maximum number of tokens an LLM can process together in one pass, including the user's input and any conversation history.

Foundation Model

A foundation model is a large-scale AI model trained on massive, diverse datasets that can be adapted to perform many different tasks with minimal additional training.