AI Model Quantization
Reduce model size and inference cost through quantization. Covers INT8, INT4 and mixed-precision quantization, post-training vs. quantization-aware training, GGUF formats, and the patterns that shrink models by 4x with minimal accuracy loss.
Quantization reduces model precision from 32-bit or 16-bit floating point to 8-bit or 4-bit integers. A 7B parameter model in FP16 is 14 GB. In INT4, it is 3.5 GB. This 4x reduction enables deploying large models on consumer GPUs, mobile devices, and cost-effective cloud instances — opening production deployment to models that would otherwise require expensive hardware.
Quantization Levels
Precision vs. Quality vs. Size (for a 7B model):
FP32 (32-bit float):
Size: 28 GB
Quality: Baseline (100%)
Speed: 1.0x
Use: Training only
FP16 / BF16 (16-bit):
Size: 14 GB
Quality: ~99.5% of FP32
Speed: ~2x
Use: Standard inference
INT8 (8-bit integer):
Size: 7 GB
Quality: ~99% of FP16
Speed: ~2-3x
Use: Production inference (good balance)
INT4 (4-bit integer):
Size: 3.5 GB
Quality: ~95-98% of FP16
Speed: ~3-4x
Use: Edge deployment, cost-sensitive production
INT2 (2-bit integer):
Size: 1.75 GB
Quality: ~85-90% of FP16 (significant degradation)
Speed: ~4-5x
Use: Experimental, extreme resource constraints
GPTQ vs. AWQ vs. GGUF
# GPTQ: Post-Training Quantization (GPU focused)
# Best for: GPU inference with maximum speed
# Method: Layer-by-layer weight quantization with calibration data
# Format: Safetensors with quantization config
# Tools: AutoGPTQ, ExLlamaV2
from auto_gptq import AutoGPTQForCausalLM
model = AutoGPTQForCausalLM.from_quantized(
"TheBloke/Llama-2-7B-GPTQ",
device_map="auto",
use_safetensors=True,
# Bits: 4 (INT4 quantization)
)
# AWQ: Activation-Aware Weight Quantization
# Best for: Balanced quality/speed, protects important weights
# Method: Identifies salient weights based on activation patterns
# Key insight: Not all weights are equally important
# Tools: AutoAWQ
from awq import AutoAWQForCausalLM
model = AutoAWQForCausalLM.from_quantized(
"TheBloke/Llama-2-7B-AWQ",
fuse_layers=True, # Kernel fusion for speed
)
# GGUF: CPU/mixed inference format (llama.cpp)
# Best for: CPU inference, Apple Silicon, consumer hardware
# Method: Multiple quantization levels within same format
# Variants: Q2_K, Q3_K_M, Q4_K_M, Q5_K_M, Q6_K, Q8_0
# Tools: llama.cpp, ollama
# Usage with llama-cpp-python
from llama_cpp import Llama
model = Llama(
model_path="llama-2-7b.Q4_K_M.gguf",
n_ctx=4096,
n_gpu_layers=35, # Offload layers to GPU
)
Anti-Patterns
| Anti-Pattern | Consequence | Fix |
|---|---|---|
| Quantize without benchmarking | Don’t know accuracy impact | Evaluate on your specific task before and after |
| INT2 for production | Too much quality loss for most tasks | INT4 minimum for production, INT8 preferred |
| Quantize embedding layers | Disproportionate quality loss | Keep embeddings at higher precision |
| No calibration data | Poor quantization quality | Use representative dataset for calibration |
| Same quantization for all layers | Sensitive layers degrade quality | Mixed-precision: higher precision for sensitive layers |
Quantization is the single most impactful optimization for model deployment. Before scaling your GPU fleet, quantize first. The 4x cost reduction from INT4 quantization often exceeds the accuracy cost — and for most production use cases, users cannot tell the difference.