Model Compression Techniques
Deploy machine learning models efficiently on edge devices and in production. Covers quantization, pruning, knowledge distillation, and the patterns that reduce model size by 10x while retaining 95% accuracy.
A state-of-the-art language model can be 175 billion parameters — far too large for a mobile device, embedded system, or cost-effective API deployment. Model compression reduces model size and inference cost while preserving performance. The goal is not a smaller model — it is the same intelligence in less space.
Compression Techniques
Quantization:
Reduce numerical precision of weights and activations
Float32 (default): 32 bits per weight → 100% size
Float16 (half): 16 bits per weight → 50% size
INT8 (integer): 8 bits per weight → 25% size
INT4 (aggressive): 4 bits per weight → 12.5% size
Example: LLaMA-2 7B
FP32: 28 GB → FP16: 14 GB → INT8: 7 GB → INT4: 3.5 GB
Accuracy impact:
FP16: ~0% loss (standard practice)
INT8: <1% loss (good for most applications)
INT4: 2-5% loss (acceptable for many use cases)
Pruning:
Remove weights that contribute least to output
Unstructured: Remove individual weights (sparse matrix)
→ 90% of weights can be zeroed with <2% accuracy loss
→ Requires sparse matrix hardware for speedup
Structured: Remove entire neurons, channels, or layers
→ 50-70% reduction with <3% accuracy loss
→ Works on standard hardware
Knowledge Distillation:
Train a small "student" model to mimic a large "teacher"
Teacher (large): BERT-Large (340M params)
Student (small): DistilBERT (66M params)
Result: 40% smaller, 60% faster, 97% of teacher's accuracy
Process:
1. Train teacher model on task
2. Generate teacher's predictions (soft labels)
3. Train student on both hard labels AND soft labels
4. Student learns teacher's "reasoning" not just answers
Quantization Implementation
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# Post-Training Quantization (PTQ)
# No retraining needed — just convert weights
# Load full-precision model
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b")
# INT8 Quantization (bitsandbytes)
model_int8 = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b",
load_in_8bit=True,
device_map="auto",
)
# Result: 7 GB instead of 28 GB, runs on consumer GPU
# INT4 Quantization (GPTQ)
model_int4 = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b",
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4", # NormalFloat4
device_map="auto",
)
# Result: 3.5 GB, fits on laptop GPU
# Comparison:
# FP32: 28 GB, latency: 500ms/token, cost: $4/hour (A100)
# INT8: 7 GB, latency: 200ms/token, cost: $1/hour (T4)
# INT4: 3.5 GB, latency: 150ms/token, cost: $0.50/hour (T4)
Anti-Patterns
| Anti-Pattern | Consequence | Fix |
|---|---|---|
| Deploy full-precision in production | 4x higher cost, slower inference | INT8 quantization as baseline |
| Quantize without evaluation | Silent accuracy degradation | Benchmark on domain-specific test set |
| One compression technique only | Leave performance on the table | Combine: distill + quantize + prune |
| Ignore calibration data | Quantized weights poorly aligned | Use representative calibration dataset |
| Same compression for all layers | Sensitive layers lose too much accuracy | Mixed-precision: keep sensitive layers at higher precision |
Model compression is the bridge between research and production. A 10x smaller model that serves 10x more users at 10x lower cost is more impactful than a marginally better model that only runs on a $30,000 GPU.