What is Inference?
Inference is the stage where a trained machine learning model is used to generate predictions or outputs on new, unseen data. In infrastructure contexts, it focuses on efficiently deploying and serving models in production.
After a model is trained on data to learn patterns, inference applies those learned patterns to fresh inputs without further updating the model's weights. This phase emphasizes speed, cost, and reliability rather than learning.
In ML infrastructure, inference often involves model serving systems that expose predictions via APIs, handle scaling, and optimize hardware usage such as GPUs or specialized accelerators. Techniques like batching, caching, and model compression help meet latency and throughput requirements.
Inference workloads can be online (real-time requests) or offline (batch processing), each with different infrastructure trade-offs around resource allocation and monitoring.
Example
A mobile app uses a trained image-classification model to instantly label photos taken by users; the model runs inference on each new photo to return labels like 'cat' or 'beach' without retraining.
Why it matters
Inference is where models deliver real value to users and businesses, so its efficiency directly affects application responsiveness, operational costs, and the ability to scale AI services.
Frequently asked questions
Training teaches the model by adjusting weights on data, while inference uses the fixed trained model to make predictions on new data.
Related terms
Training is the process of feeding data into a machine learning model so it can learn patterns and adjust its internal parameters to make accurate predictions.
Quantization is a model optimization technique that lowers the numerical precision of weights and activations, usually converting 32-bit floats to 8-bit integers or similar lower-bit formats.
Throughput measures how much work an AI system completes in a given time, such as the number of model inferences or training examples processed per second.