How do endpoints handle many users at once?

They use autoscaling, load balancing, and container orchestration to spin up more instances as traffic increases.

Can I update a model without changing the endpoint URL?

Yes, most platforms let you deploy a new model version behind the same endpoint while keeping the URL and API contract stable.

What is Endpoint?

In AI/ML infrastructure, an endpoint is a deployed URL or network address that exposes a trained model so applications can send data and receive predictions via API calls.

An endpoint is created when a model is packaged and hosted on a server or cloud platform, turning the static model file into a live service that listens for requests.

When a request arrives (usually JSON over HTTP), the endpoint loads the model, runs inference on the input, and returns the output, often with added features like authentication, logging, and scaling.

Endpoints support versioning, A/B testing, and monitoring so teams can update models without breaking downstream apps.

Example

A mobile app sends a photo to https://api.company.com/v1/classify; the endpoint runs an image-classification model and instantly replies with the predicted labels and confidence scores.

Why it matters

Endpoints are the bridge that turns trained models into usable products, enabling real-time inference at scale in production systems.

Frequently asked questions

No. The model is the trained file; the endpoint is the running service that makes the model accessible over the network.

Related terms

Model Serving

Model serving is the infrastructure process of deploying a trained ML model into production so it can receive data and return predictions via an API or service.

Inference

Inference is the stage where a trained machine learning model is used to generate predictions or outputs on new, unseen data. In infrastructure contexts, it focuses on efficiently deploying and serving models in production.

API

An API (Application Programming Interface) is a standardized set of rules that lets software applications request services or data from each other. In AI infrastructure, it typically means exposing machine learning models as callable endpoints for inference or training.

CUDA

CUDA is NVIDIA's platform and programming model that lets developers run general-purpose computations on NVIDIA GPUs instead of just CPUs.

Distillation

Knowledge distillation is a technique that transfers knowledge from a large, complex 'teacher' model to a smaller 'student' model so the student can achieve similar performance with far less compute and memory.

Edge AI

Edge AI runs AI models directly on local devices such as phones, cameras, or sensors instead of sending data to remote cloud servers.