Skip to content

What is Throughput?

Throughput measures how much work an AI system completes in a given time, such as the number of model inferences or training examples processed per second.

In AI infrastructure, throughput tracks the rate at which hardware and software together handle tasks like inference or training. It is usually expressed in queries per second (QPS), tokens per second, or samples per second.

Throughput depends on factors including batch size, hardware accelerators (GPUs/TPUs), model size, and optimization techniques such as quantization or parallelism. Larger batches often raise throughput until memory or compute limits are reached.

It is distinct from latency, which measures the time for a single request; systems can trade higher throughput for increased latency by processing requests in batches.

Example

A vision model running on four GPUs might achieve 1,200 images per second of throughput when using a batch size of 64, allowing an image-classification service to label thousands of uploaded photos every minute.

Why it matters

High throughput is essential for cost-effective, real-time AI services that must handle millions of user requests daily without adding expensive hardware.

Frequently asked questions

Throughput counts total work completed over time, while latency measures how long one individual request takes.