What is Transformer?
A Transformer is a neural network architecture that processes sequential data like text using self-attention to weigh relationships between all parts of the input at once.
It replaces older sequential models like RNNs with parallel processing via attention mechanisms, allowing the model to focus on relevant words regardless of their position in a sentence.
Key components include multi-head self-attention layers, feed-forward networks, positional encodings to retain order information, and often an encoder-decoder structure.
This design enables efficient training on massive datasets and scales well to large models used in modern language tasks.
Example
In a sentence like 'The cat sat on the mat because it was tired,' a Transformer can directly connect 'it' to 'cat' through attention scores, helping generate accurate translations or answers.
Why it matters
Transformers power nearly all state-of-the-art LLMs today, enabling breakthroughs in chatbots, translation, and content generation by handling long contexts efficiently at scale.
Frequently asked questions
Transformers process all input tokens in parallel using attention, while RNNs handle them one by one sequentially, making Transformers much faster to train on long sequences.
Related terms
The attention mechanism is a technique in neural networks that lets the model dynamically focus on the most relevant parts of the input when processing each element, rather than treating all inputs equally.
Positional encoding adds information about token order to input embeddings in transformer models, which lack any built-in sense of sequence because they process tokens in parallel.
An Encoder-Decoder is a neural network architecture that uses one model (the encoder) to compress input data into a compact representation and a second model (the decoder) to generate output from that representation.
Context length is the maximum number of tokens an LLM can process in a single input at once, acting as its effective memory window.
A context window is the maximum number of tokens an LLM can process together in one pass, including the user's input and any conversation history.
A foundation model is a large-scale AI model trained on massive, diverse datasets that can be adapted to perform many different tasks with minimal additional training.