What is Mixture of Experts?
Also known as: MoE
Mixture of Experts (MoE) is a neural network design that combines multiple specialized sub-networks called experts, using a gating or routing mechanism to activate only a subset of them for each input.
Instead of a single dense model processing every input, MoE splits the work across several expert networks. A lightweight router or gating network examines the input and decides which experts are most relevant, sending the data only to those experts.
During both training and inference, only the selected experts compute results while the rest stay inactive. This sparse activation keeps compute costs low even as the total number of parameters grows very large.
The approach lets models scale to hundreds of billions of parameters without a matching rise in FLOPs, because each token or example touches only a small fraction of the full network.
Example
Imagine a large language model with eight expert sub-networks: one strong at math, one at code, one at creative writing, etc. When the router sees a coding question, it activates only the code and general-language experts, leaving the others idle.
Why it matters
MoE enables training and serving far larger, more capable models at roughly the same inference cost as smaller dense models, which is why recent high-performance open models such as Mixtral and Grok use this architecture.
Frequently asked questions
No. In an ensemble every model runs on every input; in MoE only a few experts are chosen per input by the router, keeping computation low.
Related terms
A Transformer is a neural network architecture that processes sequential data like text using self-attention to weigh relationships between all parts of the input at once.
Ensemble learning is a machine learning approach that combines predictions from multiple models to achieve better accuracy and robustness than any individual model.
The attention mechanism is a technique in neural networks that lets the model dynamically focus on the most relevant parts of the input when processing each element, rather than treating all inputs equally.
Context length is the maximum number of tokens an LLM can process in a single input at once, acting as its effective memory window.
A context window is the maximum number of tokens an LLM can process together in one pass, including the user's input and any conversation history.
A foundation model is a large-scale AI model trained on massive, diverse datasets that can be adapted to perform many different tasks with minimal additional training.