Skip to content
Sign in

What is Distillation?

Also known as: Knowledge Distillation

Knowledge distillation is a technique that transfers knowledge from a large, complex 'teacher' model to a smaller 'student' model so the student can achieve similar performance with far less compute and memory.

A large teacher model is first trained on the original task. Its output probabilities (soft labels) are then used, often with a temperature-scaled softmax, to train the smaller student model alongside the usual hard labels.

The student learns not only the correct answers but also the teacher's uncertainty and inter-class relationships, allowing it to capture nuanced patterns that would be hard to learn from hard labels alone.

After training, only the compact student is deployed, delivering most of the teacher's accuracy at a fraction of the inference cost.

Example

A large ResNet-152 teacher is trained on ImageNet; its softened output distributions are used to train a tiny MobileNet student that runs efficiently on a smartphone while retaining most of the original accuracy.

Why it matters

Distillation is a key method for shrinking ever-larger foundation models into deployable sizes, cutting cloud costs and enabling on-device AI.

Frequently asked questions

Training from scratch uses only hard labels; distillation adds the teacher's soft probabilities, giving the student richer information and usually better results.