Serving, hardware and MLOps.
Inference is the stage where a trained machine learning model is used to generate predictions or outputs on new, unseen data. In infrastructure contexts, it focuses on efficiently deploying and serving models in production.
Quantization is a model optimization technique that lowers the numerical precision of weights and activations, usually converting 32-bit floats to 8-bit integers or similar lower-bit formats.
Throughput measures how much work an AI system completes in a given time, such as the number of model inferences or training examples processed per second.