Skip to content

What is Inference?

Inference is the stage where a trained machine learning model is used to generate predictions or outputs on new, unseen data. In infrastructure contexts, it focuses on efficiently deploying and serving models in production.

After a model is trained on data to learn patterns, inference applies those learned patterns to fresh inputs without further updating the model's weights. This phase emphasizes speed, cost, and reliability rather than learning.

In ML infrastructure, inference often involves model serving systems that expose predictions via APIs, handle scaling, and optimize hardware usage such as GPUs or specialized accelerators. Techniques like batching, caching, and model compression help meet latency and throughput requirements.

Inference workloads can be online (real-time requests) or offline (batch processing), each with different infrastructure trade-offs around resource allocation and monitoring.

Example

A mobile app uses a trained image-classification model to instantly label photos taken by users; the model runs inference on each new photo to return labels like 'cat' or 'beach' without retraining.

Why it matters

Inference is where models deliver real value to users and businesses, so its efficiency directly affects application responsiveness, operational costs, and the ability to scale AI services.

Frequently asked questions

Training teaches the model by adjusting weights on data, while inference uses the fixed trained model to make predictions on new data.