What is Latency?
Latency is the time delay between sending input to an AI system and receiving its output. In infrastructure, it measures how quickly a model processes a request and returns results.
It is typically measured in milliseconds and includes the full round-trip from request arrival through model inference to response delivery. Factors like hardware speed, model size, batching, and network overhead directly influence it.
Low latency is essential for interactive applications, while higher latency may be acceptable in batch processing. Techniques such as model optimization, caching, and specialized accelerators help reduce it.
Latency differs from throughput, which counts how many requests can be handled per second; the two often trade off against each other in production systems.
Example
When you ask a voice assistant a question, latency is the short pause between finishing your sentence and hearing its spoken reply.
Why it matters
High latency degrades user experience in real-time AI apps like chatbots and recommendation engines, while low latency enables responsive services and can be a competitive advantage.
Frequently asked questions
It depends on the use case; under 100 ms is often ideal for chat, while under 10 ms may be needed for autonomous systems.
Related terms
Inference is the stage where a trained machine learning model is used to generate predictions or outputs on new, unseen data. In infrastructure contexts, it focuses on efficiently deploying and serving models in production.
Throughput measures how much work an AI system completes in a given time, such as the number of model inferences or training examples processed per second.
Model serving is the infrastructure process of deploying a trained ML model into production so it can receive data and return predictions via an API or service.
Quantization is a model optimization technique that lowers the numerical precision of weights and activations, usually converting 32-bit floats to 8-bit integers or similar lower-bit formats.
An API (Application Programming Interface) is a standardized set of rules that lets software applications request services or data from each other. In AI infrastructure, it typically means exposing machine learning models as callable endpoints for inference or training.
CUDA is NVIDIA's platform and programming model that lets developers run general-purpose computations on NVIDIA GPUs instead of just CPUs.