What is Throughput?
Throughput measures how much work an AI system completes in a given time, such as the number of model inferences or training examples processed per second.
In AI infrastructure, throughput tracks the rate at which hardware and software together handle tasks like inference or training. It is usually expressed in queries per second (QPS), tokens per second, or samples per second.
Throughput depends on factors including batch size, hardware accelerators (GPUs/TPUs), model size, and optimization techniques such as quantization or parallelism. Larger batches often raise throughput until memory or compute limits are reached.
It is distinct from latency, which measures the time for a single request; systems can trade higher throughput for increased latency by processing requests in batches.
Example
A vision model running on four GPUs might achieve 1,200 images per second of throughput when using a batch size of 64, allowing an image-classification service to label thousands of uploaded photos every minute.
Why it matters
High throughput is essential for cost-effective, real-time AI services that must handle millions of user requests daily without adding expensive hardware.
Frequently asked questions
Throughput counts total work completed over time, while latency measures how long one individual request takes.
Related terms
Batch size is the number of training examples processed together in a single forward and backward pass during model training.
Inference is the stage where a trained machine learning model is used to generate predictions or outputs on new, unseen data. In infrastructure contexts, it focuses on efficiently deploying and serving models in production.
Quantization is a model optimization technique that lowers the numerical precision of weights and activations, usually converting 32-bit floats to 8-bit integers or similar lower-bit formats.