Quantization
Reducing model precision to fit larger models in less memory, serve them faster, and sometimes barely lose any quality.
What is quantization?
Neural network weights are stored as floating-point numbers. By default, most models train in FP32 (32 bits per parameter). A 30B parameter model in FP32 needs ~120 GB of memory just for the weights. That does not fit on most GPUs.
Quantization reduces the precision of these weights. Use fewer bits per parameter, and the model gets smaller, faster, and cheaper to serve. The tradeoff is quality — but modern quantization methods are surprisingly good at preserving it.
The precision ladder
From highest fidelity to most compressed:
Training precision. Full fidelity, but 4x the memory of FP8. Almost never used for inference.
Half precision. The default for most inference. BF16 has wider dynamic range, better for large values. A 30B model needs ~60 GB.
The sweet spot for quality-conscious serving. Near-lossless on most models. A 30B model needs ~30 GB. Supported natively on Blackwell/Hopper GPUs.
Integer quantization. Slightly less flexible than FP8 but widely supported. Good for older hardware without FP8 tensor cores.
Aggressive compression. AWQ (Activation-aware Weight Quantization) preserves important channels. GPTQ uses calibration data. A 30B model needs ~15 GB. Quality loss is measurable but often acceptable.
The llama.cpp format. Supports mixed quantization (important layers keep higher precision). Best for CPU and CPU+GPU hybrid inference. Not used with vLLM.
When to use each
Decision framework
Quality is paramount? Use FP8. Near-lossless, and on modern GPUs (Blackwell, Hopper) there is no throughput penalty because FP8 tensor cores are just as fast.
Memory-constrained? Use AWQ (INT4). Half the memory of FP8, and AWQ preserves quality better than naive INT4. This is the go-to when you need a model to fit.
Running on CPU? Use GGUF with Q4_K_M or Q5_K_M. These mixed-precision formats are designed for llama.cpp and work well on CPU and Apple Silicon.
Older GPU without FP8? Use GPTQ or AWQ. Both are well-supported in vLLM and work on any GPU with INT4/INT8 support.
Real example: Qwen3-coder on GB10
We benchmarked Qwen3-coder-30B on NVIDIA GB10 (128 GB unified memory) across quantization levels. Here is what the numbers look like:
| Quantization | Memory | Throughput | Quality |
|---|---|---|---|
| FP8 | ~30 GB | 35 tok/s single | Near-lossless |
| AWQ (INT4) | ~15 GB | Higher potential | Slight degradation |
| BF16 | ~60 GB | Lower (memory-bound) | Baseline |
Why we chose FP8
On GB10, we have 128 GB of unified memory — enough for FP8 with generous KV cache headroom. FP8 gives us the best quality with no throughput penalty on Blackwell tensor cores. AWQ would save memory but the GB10 has memory to spare, so there is no reason to trade quality.
How quantization affects throughput
The relationship between quantization and speed is not straightforward. Two factors matter:
Memory bandwidth
LLM inference is memory-bandwidth bound during the decode phase (generating tokens one at a time). Smaller weights mean fewer bytes to read from memory per token, which directly increases tok/s. This is why INT4 models can be faster than FP16 even on GPUs with FP16 tensor cores.
Compute precision
Modern GPUs have specialized hardware for different precisions. Blackwell and Hopper GPUs have FP8 tensor cores that are just as fast as FP16. Older GPUs may need to dequantize INT4 weights to FP16 before computation, which adds overhead.
Quality impact
Quantization does lose information. The question is whether it matters for your use case.
FP8: Virtually indistinguishable from FP16 on most benchmarks. Safe default.
AWQ/GPTQ (INT4): 1-3% degradation on reasoning benchmarks. Usually imperceptible in production coding tasks. Occasionally struggles with very long chains of precise arithmetic.
GGUF Q4_K_M: Similar to AWQ in quality. The mixed-precision approach keeps critical layers at higher precision.
Below 4-bit: Quality drops sharply. 2-bit and 3-bit quantization is experimental and not recommended for production.
AWQ vs GPTQ
Both are INT4 quantization methods, but they work differently:
AWQ (Activation-aware Weight Quantization)
Identifies which weight channels matter most by analyzing activation distributions. Keeps important channels at higher effective precision. Fast to apply, does not need calibration data. Generally preferred for vLLM serving.
GPTQ (GPT-Quantization)
Uses calibration data to minimize quantization error layer by layer. Slightly better quality on some models but slower to produce. Requires a representative calibration dataset.
Practical advice
For vLLM serving, start with AWQ. It is faster to load, well-supported, and quality is on par with GPTQ for most models. Use GPTQ only if you have a specific model where AWQ quality is noticeably worse.
Related serving cards
See quantization in action with real benchmark data: