Skip to content

What is Quantization?

Quantization is a model optimization technique that lowers the numerical precision of weights and activations, usually converting 32-bit floats to 8-bit integers or similar lower-bit formats.

It works by scaling and mapping the original high-precision values onto a smaller discrete range of representable numbers, then rounding or clipping them. This reduces memory usage and enables faster arithmetic on hardware that supports low-precision operations.

Common approaches include post-training quantization, applied after a model is fully trained, and quantization-aware training, which simulates lower precision during training to preserve accuracy. Calibration data is often used to determine appropriate scaling factors.

The process can be applied to weights only or to both weights and activations, with symmetric or asymmetric schemes depending on the data distribution.

Example

A 100 MB model using 32-bit floats can be quantized to 8-bit integers, shrinking it to roughly 25 MB and allowing real-time inference on a mobile phone with only minor accuracy drop.

Why it matters

Quantization enables large AI models to run efficiently on edge devices and resource-constrained hardware, cutting latency, power consumption, and cloud costs while broadening AI accessibility.

Frequently asked questions

Not always; careful calibration or quantization-aware training can keep accuracy loss very small or even negligible for many tasks.