Does increasing batch size always improve throughput?

It usually does, up to the point where memory or compute resources become saturated and throughput plateaus or drops.

Why do teams track throughput in production AI systems?

It directly affects how many users or requests the system can serve and helps determine the hardware needed to meet demand.

What is Throughput?

Throughput measures how much work an AI system completes in a given time, such as the number of model inferences or training examples processed per second.

In AI infrastructure, throughput tracks the rate at which hardware and software together handle tasks like inference or training. It is usually expressed in queries per second (QPS), tokens per second, or samples per second.

Throughput depends on factors including batch size, hardware accelerators (GPUs/TPUs), model size, and optimization techniques such as quantization or parallelism. Larger batches often raise throughput until memory or compute limits are reached.

It is distinct from latency, which measures the time for a single request; systems can trade higher throughput for increased latency by processing requests in batches.

Example

A vision model running on four GPUs might achieve 1,200 images per second of throughput when using a batch size of 64, allowing an image-classification service to label thousands of uploaded photos every minute.

Why it matters

High throughput is essential for cost-effective, real-time AI services that must handle millions of user requests daily without adding expensive hardware.

Frequently asked questions

Throughput counts total work completed over time, while latency measures how long one individual request takes.

Related terms

Batch Size

Batch size is the number of training examples processed together in a single forward and backward pass during model training.

Inference

Inference is the stage where a trained machine learning model is used to generate predictions or outputs on new, unseen data. In infrastructure contexts, it focuses on efficiently deploying and serving models in production.

Quantization

Quantization is a model optimization technique that lowers the numerical precision of weights and activations, usually converting 32-bit floats to 8-bit integers or similar lower-bit formats.