What is the 'temperature' parameter in distillation?

Temperature softens the teacher's probability distribution, revealing more about which classes the teacher finds similar and helping the student learn those relationships.

Can distillation be used with any model architecture?

Yes, the teacher and student can have completely different architectures as long as the student can be trained to match the teacher's output distribution.

What is Distillation?

Also known as: Knowledge Distillation

Knowledge distillation is a technique that transfers knowledge from a large, complex 'teacher' model to a smaller 'student' model so the student can achieve similar performance with far less compute and memory.

A large teacher model is first trained on the original task. Its output probabilities (soft labels) are then used, often with a temperature-scaled softmax, to train the smaller student model alongside the usual hard labels.

The student learns not only the correct answers but also the teacher's uncertainty and inter-class relationships, allowing it to capture nuanced patterns that would be hard to learn from hard labels alone.

After training, only the compact student is deployed, delivering most of the teacher's accuracy at a fraction of the inference cost.

Example

A large ResNet-152 teacher is trained on ImageNet; its softened output distributions are used to train a tiny MobileNet student that runs efficiently on a smartphone while retaining most of the original accuracy.

Why it matters

Distillation is a key method for shrinking ever-larger foundation models into deployable sizes, cutting cloud costs and enabling on-device AI.

Frequently asked questions

Training from scratch uses only hard labels; distillation adds the teacher's soft probabilities, giving the student richer information and usually better results.

Related terms

Quantization

Quantization is a model optimization technique that lowers the numerical precision of weights and activations, usually converting 32-bit floats to 8-bit integers or similar lower-bit formats.

Transfer Learning

Transfer learning is a machine learning method that reuses a model trained on one task as the starting point for a different but related task.

Neural Network

A neural network, or artificial neural network (ANN), is a computational model inspired by the human brain that learns to recognize patterns in data by passing information through layers of interconnected artificial neurons.

API

An API (Application Programming Interface) is a standardized set of rules that lets software applications request services or data from each other. In AI infrastructure, it typically means exposing machine learning models as callable endpoints for inference or training.

CUDA

CUDA is NVIDIA's platform and programming model that lets developers run general-purpose computations on NVIDIA GPUs instead of just CPUs.

Edge AI

Edge AI runs AI models directly on local devices such as phones, cameras, or sensors instead of sending data to remote cloud servers.