ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

Model Compression Techniques

Deploy machine learning models efficiently on edge devices and in production. Covers quantization, pruning, knowledge distillation, and the patterns that reduce model size by 10x while retaining 95% accuracy.

A state-of-the-art language model can be 175 billion parameters — far too large for a mobile device, embedded system, or cost-effective API deployment. Model compression reduces model size and inference cost while preserving performance. The goal is not a smaller model — it is the same intelligence in less space.


Compression Techniques

Quantization:
  Reduce numerical precision of weights and activations
  
  Float32 (default):  32 bits per weight  → 100% size
  Float16 (half):     16 bits per weight  → 50% size
  INT8 (integer):      8 bits per weight  → 25% size
  INT4 (aggressive):   4 bits per weight  → 12.5% size
  
  Example: LLaMA-2 7B
  FP32: 28 GB → FP16: 14 GB → INT8: 7 GB → INT4: 3.5 GB
  
  Accuracy impact:
  FP16: ~0% loss (standard practice)
  INT8: <1% loss (good for most applications)
  INT4: 2-5% loss (acceptable for many use cases)

Pruning:
  Remove weights that contribute least to output
  
  Unstructured: Remove individual weights (sparse matrix)
    → 90% of weights can be zeroed with <2% accuracy loss
    → Requires sparse matrix hardware for speedup
  
  Structured: Remove entire neurons, channels, or layers
    → 50-70% reduction with <3% accuracy loss
    → Works on standard hardware

Knowledge Distillation:
  Train a small "student" model to mimic a large "teacher"
  
  Teacher (large):  BERT-Large (340M params)
  Student (small):  DistilBERT (66M params)
  Result: 40% smaller, 60% faster, 97% of teacher's accuracy
  
  Process:
  1. Train teacher model on task
  2. Generate teacher's predictions (soft labels)
  3. Train student on both hard labels AND soft labels
  4. Student learns teacher's "reasoning" not just answers

Quantization Implementation

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Post-Training Quantization (PTQ)
# No retraining needed — just convert weights

# Load full-precision model
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b")

# INT8 Quantization (bitsandbytes)
model_int8 = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b",
    load_in_8bit=True,
    device_map="auto",
)
# Result: 7 GB instead of 28 GB, runs on consumer GPU

# INT4 Quantization (GPTQ)
model_int4 = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b",
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",  # NormalFloat4
    device_map="auto",
)
# Result: 3.5 GB, fits on laptop GPU

# Comparison:
# FP32: 28 GB,  latency: 500ms/token,  cost: $4/hour (A100)
# INT8:  7 GB,  latency: 200ms/token,  cost: $1/hour (T4)
# INT4: 3.5 GB, latency: 150ms/token,  cost: $0.50/hour (T4)

Anti-Patterns

Anti-PatternConsequenceFix
Deploy full-precision in production4x higher cost, slower inferenceINT8 quantization as baseline
Quantize without evaluationSilent accuracy degradationBenchmark on domain-specific test set
One compression technique onlyLeave performance on the tableCombine: distill + quantize + prune
Ignore calibration dataQuantized weights poorly alignedUse representative calibration dataset
Same compression for all layersSensitive layers lose too much accuracyMixed-precision: keep sensitive layers at higher precision

Model compression is the bridge between research and production. A 10x smaller model that serves 10x more users at 10x lower cost is more impactful than a marginally better model that only runs on a $30,000 GPU.

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →