Skip to content
Sign in

What is Latency?

Latency is the time delay between sending input to an AI system and receiving its output. In infrastructure, it measures how quickly a model processes a request and returns results.

It is typically measured in milliseconds and includes the full round-trip from request arrival through model inference to response delivery. Factors like hardware speed, model size, batching, and network overhead directly influence it.

Low latency is essential for interactive applications, while higher latency may be acceptable in batch processing. Techniques such as model optimization, caching, and specialized accelerators help reduce it.

Latency differs from throughput, which counts how many requests can be handled per second; the two often trade off against each other in production systems.

Example

When you ask a voice assistant a question, latency is the short pause between finishing your sentence and hearing its spoken reply.

Why it matters

High latency degrades user experience in real-time AI apps like chatbots and recommendation engines, while low latency enables responsive services and can be a competitive advantage.

Frequently asked questions

It depends on the use case; under 100 ms is often ideal for chat, while under 10 ms may be needed for autonomous systems.