How is quantization different from pruning?

Quantization reduces the bit-width of numbers, while pruning removes entire weights or connections that are less important.

Can I quantize any model?

Most neural networks can be quantized, but very sensitive models may need extra techniques like fine-tuning to maintain performance.

What is Quantization?

Quantization is a model optimization technique that lowers the numerical precision of weights and activations, usually converting 32-bit floats to 8-bit integers or similar lower-bit formats.

It works by scaling and mapping the original high-precision values onto a smaller discrete range of representable numbers, then rounding or clipping them. This reduces memory usage and enables faster arithmetic on hardware that supports low-precision operations.

Common approaches include post-training quantization, applied after a model is fully trained, and quantization-aware training, which simulates lower precision during training to preserve accuracy. Calibration data is often used to determine appropriate scaling factors.

The process can be applied to weights only or to both weights and activations, with symmetric or asymmetric schemes depending on the data distribution.

Example

A 100 MB model using 32-bit floats can be quantized to 8-bit integers, shrinking it to roughly 25 MB and allowing real-time inference on a mobile phone with only minor accuracy drop.

Why it matters

Quantization enables large AI models to run efficiently on edge devices and resource-constrained hardware, cutting latency, power consumption, and cloud costs while broadening AI accessibility.

Frequently asked questions

Not always; careful calibration or quantization-aware training can keep accuracy loss very small or even negligible for many tasks.

Related terms

Inference

Inference is the stage where a trained machine learning model is used to generate predictions or outputs on new, unseen data. In infrastructure contexts, it focuses on efficiently deploying and serving models in production.

Throughput

Throughput measures how much work an AI system completes in a given time, such as the number of model inferences or training examples processed per second.