What is Distillation?
Also known as: Knowledge Distillation
Knowledge distillation is a technique that transfers knowledge from a large, complex 'teacher' model to a smaller 'student' model so the student can achieve similar performance with far less compute and memory.
A large teacher model is first trained on the original task. Its output probabilities (soft labels) are then used, often with a temperature-scaled softmax, to train the smaller student model alongside the usual hard labels.
The student learns not only the correct answers but also the teacher's uncertainty and inter-class relationships, allowing it to capture nuanced patterns that would be hard to learn from hard labels alone.
After training, only the compact student is deployed, delivering most of the teacher's accuracy at a fraction of the inference cost.
Example
A large ResNet-152 teacher is trained on ImageNet; its softened output distributions are used to train a tiny MobileNet student that runs efficiently on a smartphone while retaining most of the original accuracy.
Why it matters
Distillation is a key method for shrinking ever-larger foundation models into deployable sizes, cutting cloud costs and enabling on-device AI.
Frequently asked questions
Training from scratch uses only hard labels; distillation adds the teacher's soft probabilities, giving the student richer information and usually better results.
Related terms
Quantization is a model optimization technique that lowers the numerical precision of weights and activations, usually converting 32-bit floats to 8-bit integers or similar lower-bit formats.
Transfer learning is a machine learning method that reuses a model trained on one task as the starting point for a different but related task.
A neural network, or artificial neural network (ANN), is a computational model inspired by the human brain that learns to recognize patterns in data by passing information through layers of interconnected artificial neurons.
An API (Application Programming Interface) is a standardized set of rules that lets software applications request services or data from each other. In AI infrastructure, it typically means exposing machine learning models as callable endpoints for inference or training.
CUDA is NVIDIA's platform and programming model that lets developers run general-purpose computations on NVIDIA GPUs instead of just CPUs.
Edge AI runs AI models directly on local devices such as phones, cameras, or sensors instead of sending data to remote cloud servers.