AI Model Quantization | The Garnet Wiki

Quantization reduces model precision from 32-bit or 16-bit floating point to 8-bit or 4-bit integers. A 7B parameter model in FP16 is 14 GB. In INT4, it is 3.5 GB. This 4x reduction enables deploying large models on consumer GPUs, mobile devices, and cost-effective cloud instances — opening production deployment to models that would otherwise require expensive hardware.

Quantization Levels

Precision vs. Quality vs. Size (for a 7B model):

FP32 (32-bit float):
  Size: 28 GB
  Quality: Baseline (100%)
  Speed: 1.0x
  Use: Training only

FP16 / BF16 (16-bit):
  Size: 14 GB
  Quality: ~99.5% of FP32
  Speed: ~2x
  Use: Standard inference

INT8 (8-bit integer):
  Size: 7 GB
  Quality: ~99% of FP16
  Speed: ~2-3x
  Use: Production inference (good balance)

INT4 (4-bit integer):
  Size: 3.5 GB
  Quality: ~95-98% of FP16
  Speed: ~3-4x
  Use: Edge deployment, cost-sensitive production

INT2 (2-bit integer):
  Size: 1.75 GB
  Quality: ~85-90% of FP16 (significant degradation)
  Speed: ~4-5x
  Use: Experimental, extreme resource constraints

GPTQ vs. AWQ vs. GGUF

# GPTQ: Post-Training Quantization (GPU focused)
# Best for: GPU inference with maximum speed
# Method: Layer-by-layer weight quantization with calibration data
# Format: Safetensors with quantization config
# Tools: AutoGPTQ, ExLlamaV2

from auto_gptq import AutoGPTQForCausalLM
model = AutoGPTQForCausalLM.from_quantized(
    "TheBloke/Llama-2-7B-GPTQ",
    device_map="auto",
    use_safetensors=True,
    # Bits: 4 (INT4 quantization)
)

# AWQ: Activation-Aware Weight Quantization
# Best for: Balanced quality/speed, protects important weights
# Method: Identifies salient weights based on activation patterns
# Key insight: Not all weights are equally important
# Tools: AutoAWQ

from awq import AutoAWQForCausalLM
model = AutoAWQForCausalLM.from_quantized(
    "TheBloke/Llama-2-7B-AWQ",
    fuse_layers=True,  # Kernel fusion for speed
)

# GGUF: CPU/mixed inference format (llama.cpp)
# Best for: CPU inference, Apple Silicon, consumer hardware
# Method: Multiple quantization levels within same format
# Variants: Q2_K, Q3_K_M, Q4_K_M, Q5_K_M, Q6_K, Q8_0
# Tools: llama.cpp, ollama

# Usage with llama-cpp-python
from llama_cpp import Llama
model = Llama(
    model_path="llama-2-7b.Q4_K_M.gguf",
    n_ctx=4096,
    n_gpu_layers=35,  # Offload layers to GPU
)

Anti-Patterns

Anti-Pattern	Consequence	Fix
Quantize without benchmarking	Don’t know accuracy impact	Evaluate on your specific task before and after
INT2 for production	Too much quality loss for most tasks	INT4 minimum for production, INT8 preferred
Quantize embedding layers	Disproportionate quality loss	Keep embeddings at higher precision
No calibration data	Poor quantization quality	Use representative dataset for calibration
Same quantization for all layers	Sensitive layers degrade quality	Mixed-precision: higher precision for sensitive layers

Quantization is the single most impactful optimization for model deployment. Before scaling your GPU fleet, quantize first. The 4x cost reduction from INT4 quantization often exceeds the accuracy cost — and for most production use cases, users cannot tell the difference.

Quantization Levels

GPTQ vs. AWQ vs. GGUF

Anti-Patterns

More in AI & Machine Learning

Responsible AI: Bias Detection & Mitigation

Agentic AI: Orchestration Frameworks

AI Cost Optimization: GPU vs API vs Edge