Model Compression Techniques

A state-of-the-art language model can be 175 billion parameters — far too large for a mobile device, embedded system, or cost-effective API deployment. Model compression reduces model size and inference cost while preserving performance. The goal is not a smaller model — it is the same intelligence in less space.

Compression Techniques

Quantization:
  Reduce numerical precision of weights and activations
  
  Float32 (default):  32 bits per weight  → 100% size
  Float16 (half):     16 bits per weight  → 50% size
  INT8 (integer):      8 bits per weight  → 25% size
  INT4 (aggressive):   4 bits per weight  → 12.5% size
  
  Example: LLaMA-2 7B
  FP32: 28 GB → FP16: 14 GB → INT8: 7 GB → INT4: 3.5 GB
  
  Accuracy impact:
  FP16: ~0% loss (standard practice)
  INT8: <1% loss (good for most applications)
  INT4: 2-5% loss (acceptable for many use cases)

Pruning:
  Remove weights that contribute least to output
  
  Unstructured: Remove individual weights (sparse matrix)
    → 90% of weights can be zeroed with <2% accuracy loss
    → Requires sparse matrix hardware for speedup
  
  Structured: Remove entire neurons, channels, or layers
    → 50-70% reduction with <3% accuracy loss
    → Works on standard hardware

Knowledge Distillation:
  Train a small "student" model to mimic a large "teacher"
  
  Teacher (large):  BERT-Large (340M params)
  Student (small):  DistilBERT (66M params)
  Result: 40% smaller, 60% faster, 97% of teacher's accuracy
  
  Process:
  1. Train teacher model on task
  2. Generate teacher's predictions (soft labels)
  3. Train student on both hard labels AND soft labels
  4. Student learns teacher's "reasoning" not just answers

Quantization Implementation

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Post-Training Quantization (PTQ)
# No retraining needed — just convert weights

# Load full-precision model
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b")

# INT8 Quantization (bitsandbytes)
model_int8 = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b",
    load_in_8bit=True,
    device_map="auto",
)
# Result: 7 GB instead of 28 GB, runs on consumer GPU

# INT4 Quantization (GPTQ)
model_int4 = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b",
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",  # NormalFloat4
    device_map="auto",
)
# Result: 3.5 GB, fits on laptop GPU

# Comparison:
# FP32: 28 GB,  latency: 500ms/token,  cost: $4/hour (A100)
# INT8:  7 GB,  latency: 200ms/token,  cost: $1/hour (T4)
# INT4: 3.5 GB, latency: 150ms/token,  cost: $0.50/hour (T4)

Anti-Patterns

Anti-Pattern	Consequence	Fix
Deploy full-precision in production	4x higher cost, slower inference	INT8 quantization as baseline
Quantize without evaluation	Silent accuracy degradation	Benchmark on domain-specific test set
One compression technique only	Leave performance on the table	Combine: distill + quantize + prune
Ignore calibration data	Quantized weights poorly aligned	Use representative calibration dataset
Same compression for all layers	Sensitive layers lose too much accuracy	Mixed-precision: keep sensitive layers at higher precision

Model compression is the bridge between research and production. A 10x smaller model that serves 10x more users at 10x lower cost is more impactful than a marginally better model that only runs on a $30,000 GPU.

Compression Techniques

Quantization Implementation

Anti-Patterns

More in AI Engineering

AI Agent Orchestration

AI Agent Tool Selection Optimization

AI Agent Architecture