ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

AI Model Quantization

Reduce model size and inference cost through quantization. Covers INT8, INT4 and mixed-precision quantization, post-training vs. quantization-aware training, GGUF formats, and the patterns that shrink models by 4x with minimal accuracy loss.

Quantization reduces model precision from 32-bit or 16-bit floating point to 8-bit or 4-bit integers. A 7B parameter model in FP16 is 14 GB. In INT4, it is 3.5 GB. This 4x reduction enables deploying large models on consumer GPUs, mobile devices, and cost-effective cloud instances — opening production deployment to models that would otherwise require expensive hardware.


Quantization Levels

Precision vs. Quality vs. Size (for a 7B model):

FP32 (32-bit float):
  Size: 28 GB
  Quality: Baseline (100%)
  Speed: 1.0x
  Use: Training only

FP16 / BF16 (16-bit):
  Size: 14 GB
  Quality: ~99.5% of FP32
  Speed: ~2x
  Use: Standard inference

INT8 (8-bit integer):
  Size: 7 GB
  Quality: ~99% of FP16
  Speed: ~2-3x
  Use: Production inference (good balance)

INT4 (4-bit integer):
  Size: 3.5 GB
  Quality: ~95-98% of FP16
  Speed: ~3-4x
  Use: Edge deployment, cost-sensitive production

INT2 (2-bit integer):
  Size: 1.75 GB
  Quality: ~85-90% of FP16 (significant degradation)
  Speed: ~4-5x
  Use: Experimental, extreme resource constraints

GPTQ vs. AWQ vs. GGUF

# GPTQ: Post-Training Quantization (GPU focused)
# Best for: GPU inference with maximum speed
# Method: Layer-by-layer weight quantization with calibration data
# Format: Safetensors with quantization config
# Tools: AutoGPTQ, ExLlamaV2

from auto_gptq import AutoGPTQForCausalLM
model = AutoGPTQForCausalLM.from_quantized(
    "TheBloke/Llama-2-7B-GPTQ",
    device_map="auto",
    use_safetensors=True,
    # Bits: 4 (INT4 quantization)
)

# AWQ: Activation-Aware Weight Quantization
# Best for: Balanced quality/speed, protects important weights
# Method: Identifies salient weights based on activation patterns
# Key insight: Not all weights are equally important
# Tools: AutoAWQ

from awq import AutoAWQForCausalLM
model = AutoAWQForCausalLM.from_quantized(
    "TheBloke/Llama-2-7B-AWQ",
    fuse_layers=True,  # Kernel fusion for speed
)

# GGUF: CPU/mixed inference format (llama.cpp)
# Best for: CPU inference, Apple Silicon, consumer hardware
# Method: Multiple quantization levels within same format
# Variants: Q2_K, Q3_K_M, Q4_K_M, Q5_K_M, Q6_K, Q8_0
# Tools: llama.cpp, ollama

# Usage with llama-cpp-python
from llama_cpp import Llama
model = Llama(
    model_path="llama-2-7b.Q4_K_M.gguf",
    n_ctx=4096,
    n_gpu_layers=35,  # Offload layers to GPU
)

Anti-Patterns

Anti-PatternConsequenceFix
Quantize without benchmarkingDon’t know accuracy impactEvaluate on your specific task before and after
INT2 for productionToo much quality loss for most tasksINT4 minimum for production, INT8 preferred
Quantize embedding layersDisproportionate quality lossKeep embeddings at higher precision
No calibration dataPoor quantization qualityUse representative dataset for calibration
Same quantization for all layersSensitive layers degrade qualityMixed-precision: higher precision for sensitive layers

Quantization is the single most impactful optimization for model deployment. Before scaling your GPU fleet, quantize first. The 4x cost reduction from INT4 quantization often exceeds the accuracy cost — and for most production use cases, users cannot tell the difference.

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →