A model compression technique that reduces the numerical precision of a model's weights, significantly decreasing memory requirements and inference cost with minimal loss in output quality.
Neural network weights are typically stored as 32-bit or 16-bit floating point numbers. Quantization reduces this precision — commonly to 8-bit integers (INT8) or even 4-bit (INT4) — dramatically shrinking the model's memory footprint. A 7B parameter model that requires 14GB in FP16 can be compressed to under 4GB with 4-bit quantization.
This compression makes it practical to run large models on consumer hardware. Tools like llama.cpp, Ollama, and Hugging Face's bitsandbytes library make quantized model deployment accessible to any developer. Running a quantized Llama 3 8B model on a laptop with 8GB of RAM is now entirely feasible.
The tradeoff is a small degradation in output quality, which varies by task and quantization level. For most practical applications, 4-bit quantized models perform remarkably close to their full-precision counterparts. Quantization-aware training (QAT) — where quantization is incorporated during training — further closes this quality gap.
Weekly AI tool reviews, news digests, and how-to guides.
Join 12,000+ builders