What is API?
An API (Application Programming Interface) is a standardized set of rules that lets software applications request services or data from each other. In AI infrastructure, it typically means exposing machine learning models as callable endpoints for inference or training.
APIs work by defining request formats (inputs like JSON payloads) and response formats (outputs like predictions), along with protocols such as HTTP/REST or gRPC. The server handles authentication, rate limiting, and scaling while the client simply makes calls without knowing internal model details.
Key ideas include abstraction (hiding model complexity), interoperability (connecting different languages or systems), and versioning (managing updates without breaking clients). In ML infra this often involves containerized model servers behind load balancers.
Modern AI APIs also incorporate monitoring, logging, and A/B testing to track performance and cost in production environments.
Example
A developer sends a text prompt via HTTP POST to an OpenAI-style inference API and receives a JSON response containing the model's generated text, without installing any models locally.
Why it matters
APIs turn trained models into reusable services that any application can consume, enabling rapid integration of AI into products and lowering the barrier for teams without deep ML expertise.
Frequently asked questions
It is a web-accessible interface that lets you send data to a model and receive predictions without managing the underlying infrastructure.
Related terms
Model serving is the infrastructure process of deploying a trained ML model into production so it can receive data and return predictions via an API or service.
CUDA is NVIDIA's platform and programming model that lets developers run general-purpose computations on NVIDIA GPUs instead of just CPUs.
Knowledge distillation is a technique that transfers knowledge from a large, complex 'teacher' model to a smaller 'student' model so the student can achieve similar performance with far less compute and memory.
Edge AI runs AI models directly on local devices such as phones, cameras, or sensors instead of sending data to remote cloud servers.
In AI/ML infrastructure, an endpoint is a deployed URL or network address that exposes a trained model so applications can send data and receive predictions via API calls.
FLOPs stands for floating-point operations and counts the total number of arithmetic calculations (additions, multiplications) a neural network performs during a forward or backward pass.