What is Endpoint?
In AI/ML infrastructure, an endpoint is a deployed URL or network address that exposes a trained model so applications can send data and receive predictions via API calls.
An endpoint is created when a model is packaged and hosted on a server or cloud platform, turning the static model file into a live service that listens for requests.
When a request arrives (usually JSON over HTTP), the endpoint loads the model, runs inference on the input, and returns the output, often with added features like authentication, logging, and scaling.
Endpoints support versioning, A/B testing, and monitoring so teams can update models without breaking downstream apps.
Example
A mobile app sends a photo to https://api.company.com/v1/classify; the endpoint runs an image-classification model and instantly replies with the predicted labels and confidence scores.
Why it matters
Endpoints are the bridge that turns trained models into usable products, enabling real-time inference at scale in production systems.
Frequently asked questions
No. The model is the trained file; the endpoint is the running service that makes the model accessible over the network.
Related terms
Model serving is the infrastructure process of deploying a trained ML model into production so it can receive data and return predictions via an API or service.
Inference is the stage where a trained machine learning model is used to generate predictions or outputs on new, unseen data. In infrastructure contexts, it focuses on efficiently deploying and serving models in production.
An API (Application Programming Interface) is a standardized set of rules that lets software applications request services or data from each other. In AI infrastructure, it typically means exposing machine learning models as callable endpoints for inference or training.
CUDA is NVIDIA's platform and programming model that lets developers run general-purpose computations on NVIDIA GPUs instead of just CPUs.
Knowledge distillation is a technique that transfers knowledge from a large, complex 'teacher' model to a smaller 'student' model so the student can achieve similar performance with far less compute and memory.
Edge AI runs AI models directly on local devices such as phones, cameras, or sensors instead of sending data to remote cloud servers.