What is Model Serving?
Model serving is the infrastructure process of deploying a trained ML model into production so it can receive data and return predictions via an API or service.
It involves loading the saved model artifacts into a runtime environment, exposing an endpoint (often REST or gRPC), and handling incoming inference requests efficiently while managing resources like CPU/GPU.
Key ideas include optimizing for low latency, high throughput, versioning models, auto-scaling based on traffic, and monitoring performance metrics such as prediction accuracy and system health.
Modern serving often uses specialized frameworks or containers to isolate the model from training code and integrate with orchestration tools for reliability.
Example
A retail app loads its trained recommendation model into a serving system; when a user browses products, the app sends their data to the model's API endpoint and instantly receives personalized suggestions.
Why it matters
Without reliable model serving, even highly accurate models remain unusable in real applications, making it a critical bridge between ML research and production impact in AI systems today.
Frequently asked questions
Training builds the model on data, while serving runs the finished model to generate predictions on new inputs.
Related terms
Inference is the stage where a trained machine learning model is used to generate predictions or outputs on new, unseen data. In infrastructure contexts, it focuses on efficiently deploying and serving models in production.
MLOps is the practice of combining machine learning, DevOps, and data engineering to reliably build, deploy, and maintain ML models in production.
An API (Application Programming Interface) is a standardized set of rules that lets software applications request services or data from each other. In AI infrastructure, it typically means exposing machine learning models as callable endpoints for inference or training.
CUDA is NVIDIA's platform and programming model that lets developers run general-purpose computations on NVIDIA GPUs instead of just CPUs.
Knowledge distillation is a technique that transfers knowledge from a large, complex 'teacher' model to a smaller 'student' model so the student can achieve similar performance with far less compute and memory.
Edge AI runs AI models directly on local devices such as phones, cameras, or sensors instead of sending data to remote cloud servers.