Skip to content
Sign in

What is Mixture of Experts?

Also known as: MoE

Mixture of Experts (MoE) is a neural network design that combines multiple specialized sub-networks called experts, using a gating or routing mechanism to activate only a subset of them for each input.

Instead of a single dense model processing every input, MoE splits the work across several expert networks. A lightweight router or gating network examines the input and decides which experts are most relevant, sending the data only to those experts.

During both training and inference, only the selected experts compute results while the rest stay inactive. This sparse activation keeps compute costs low even as the total number of parameters grows very large.

The approach lets models scale to hundreds of billions of parameters without a matching rise in FLOPs, because each token or example touches only a small fraction of the full network.

Example

Imagine a large language model with eight expert sub-networks: one strong at math, one at code, one at creative writing, etc. When the router sees a coding question, it activates only the code and general-language experts, leaving the others idle.

Why it matters

MoE enables training and serving far larger, more capable models at roughly the same inference cost as smaller dense models, which is why recent high-performance open models such as Mixtral and Grok use this architecture.

Frequently asked questions

No. In an ensemble every model runs on every input; in MoE only a few experts are chosen per input by the router, keeping computation low.