Do all experts get trained on every example?

Only the experts selected by the router receive gradients for that example, so each expert specializes while the router learns to pick the right ones.

Does MoE always improve speed?

It improves parameter efficiency and can increase speed when the number of active experts is kept small, but poorly tuned routing can add overhead.

What is Mixture of Experts?

Also known as: MoE

Mixture of Experts (MoE) is a neural network design that combines multiple specialized sub-networks called experts, using a gating or routing mechanism to activate only a subset of them for each input.

Instead of a single dense model processing every input, MoE splits the work across several expert networks. A lightweight router or gating network examines the input and decides which experts are most relevant, sending the data only to those experts.

During both training and inference, only the selected experts compute results while the rest stay inactive. This sparse activation keeps compute costs low even as the total number of parameters grows very large.

The approach lets models scale to hundreds of billions of parameters without a matching rise in FLOPs, because each token or example touches only a small fraction of the full network.

Example

Imagine a large language model with eight expert sub-networks: one strong at math, one at code, one at creative writing, etc. When the router sees a coding question, it activates only the code and general-language experts, leaving the others idle.

Why it matters

MoE enables training and serving far larger, more capable models at roughly the same inference cost as smaller dense models, which is why recent high-performance open models such as Mixtral and Grok use this architecture.

Frequently asked questions

No. In an ensemble every model runs on every input; in MoE only a few experts are chosen per input by the router, keeping computation low.

Related terms

Transformer

A Transformer is a neural network architecture that processes sequential data like text using self-attention to weigh relationships between all parts of the input at once.

Ensemble Learning

Ensemble learning is a machine learning approach that combines predictions from multiple models to achieve better accuracy and robustness than any individual model.

Attention Mechanism

The attention mechanism is a technique in neural networks that lets the model dynamically focus on the most relevant parts of the input when processing each element, rather than treating all inputs equally.

Context Length

Context length is the maximum number of tokens an LLM can process in a single input at once, acting as its effective memory window.

Context Window

A context window is the maximum number of tokens an LLM can process together in one pass, including the user's input and any conversation history.

Foundation Model

A foundation model is a large-scale AI model trained on massive, diverse datasets that can be adapted to perform many different tasks with minimal additional training.