AI Cost Optimization: GPU vs API vs Edge
Optimize AI infrastructure costs across GPU, API, and edge deployments. Covers cost modeling, deployment architectures, model quantization, batch optimization, and build-vs-buy analysis.
AI infrastructure costs catch organizations off guard. A prototype that costs $50/month at 100 queries/day becomes $15,000/month at 10,000 queries/day — and that’s before you account for GPU idle time, egress charges, and the engineering hours spent managing infrastructure. The difference between a well-optimized AI deployment and a naive one can be 10-50x in monthly spend.
This guide covers the full cost optimization landscape: when to use API providers vs self-hosted GPUs vs edge devices, how to model costs accurately before scaling, and the engineering techniques that reduce costs without sacrificing quality.
The Three Deployment Models
| Model | Best For | Monthly Cost Range | Latency | Control |
|---|---|---|---|---|
| API (hosted) | Prototyping, variable load, state-of-the-art models | $0.50-50K (usage-based) | 100-3000ms | Low |
| GPU (self-hosted) | High volume, data sovereignty, custom models | $2K-100K (fixed infra) | 10-500ms | Full |
| Edge | Real-time, offline, privacy-sensitive | $0-5K (hardware) | 1-50ms | Full |
Cost Modeling Framework
API Cost Calculator
def calculate_api_cost(
model: str,
daily_requests: int,
avg_input_tokens: int,
avg_output_tokens: int,
days: int = 30,
):
"""Calculate monthly API costs for LLM inference."""
PRICING = {
# Per 1M tokens: (input, output)
"gpt-4o": (2.50, 10.00),
"gpt-4o-mini": (0.15, 0.60),
"claude-3.5-sonnet": (3.00, 15.00),
"claude-3.5-haiku": (0.80, 4.00),
"gemini-2.0-flash": (0.10, 0.40),
"gemini-2.0-pro": (1.25, 5.00),
"llama-3.1-405b-fireworks": (3.00, 3.00),
"llama-3.1-70b-fireworks": (0.90, 0.90),
"mistral-large": (2.00, 6.00),
}
input_price, output_price = PRICING.get(model, (5.00, 15.00))
total_requests = daily_requests * days
total_input_tokens = total_requests * avg_input_tokens
total_output_tokens = total_requests * avg_output_tokens
input_cost = (total_input_tokens / 1_000_000) * input_price
output_cost = (total_output_tokens / 1_000_000) * output_price
total_cost = input_cost + output_cost
return {
"model": model,
"total_requests": total_requests,
"input_cost": round(input_cost, 2),
"output_cost": round(output_cost, 2),
"total_monthly_cost": round(total_cost, 2),
"cost_per_request": round(total_cost / total_requests, 6),
"cost_per_1k_requests": round(total_cost / total_requests * 1000, 2),
}
# Example: Compare models at scale
for model in ["gpt-4o", "gpt-4o-mini", "claude-3.5-haiku", "gemini-2.0-flash"]:
result = calculate_api_cost(
model=model,
daily_requests=10_000,
avg_input_tokens=800,
avg_output_tokens=400,
)
print(f"{model}: ${result['total_monthly_cost']}/mo "
f"(${result['cost_per_1k_requests']}/1K req)")
Output at 10K requests/day:
| Model | Monthly Cost | Per 1K Requests | Quality Tier |
|---|---|---|---|
| GPT-4o | $1,050 | $3.50 | Premium |
| GPT-4o-mini | $63 | $0.21 | Good |
| Claude 3.5 Haiku | $168 | $0.56 | Good |
| Gemini 2.0 Flash | $42 | $0.14 | Good |
GPU Self-Hosting Cost Model
def calculate_gpu_cost(
gpu_type: str,
num_gpus: int,
utilization: float = 0.6,
provider: str = "aws",
):
"""Calculate monthly GPU hosting costs."""
GPU_HOURLY = {
# (on-demand, reserved_1yr, spot)
"a100-80gb": {"aws": (32.77, 20.90, 9.83), "gcp": (29.62, 18.77, 8.89)},
"h100-80gb": {"aws": (65.00, 41.50, 19.50), "gcp": (52.40, 33.02, 15.72)},
"a10g": {"aws": (10.68, 7.02, 3.21), "gcp": (8.40, 5.50, 2.52)},
"l4": {"aws": (7.35, 4.92, 2.21), "gcp": (5.67, 3.69, 1.70)},
"t4": {"aws": (3.91, 2.56, 1.17), "gcp": (2.93, 1.91, 0.88)},
}
on_demand, reserved, spot = GPU_HOURLY[gpu_type][provider]
hours_per_month = 730 # 24 * 365 / 12
return {
"gpu_type": gpu_type,
"num_gpus": num_gpus,
"provider": provider,
"monthly_on_demand": round(on_demand * hours_per_month * num_gpus, 2),
"monthly_reserved": round(reserved * hours_per_month * num_gpus, 2),
"monthly_spot": round(spot * hours_per_month * num_gpus, 2),
"annual_savings_reserved": round(
(on_demand - reserved) * hours_per_month * num_gpus * 12, 2
),
}
Break-Even Analysis
def api_vs_gpu_breakeven(
model: str,
gpu_type: str,
throughput_per_gpu: int, # requests/hour per GPU
avg_input_tokens: int,
avg_output_tokens: int,
):
"""Find the daily request volume where self-hosting becomes cheaper."""
for daily_requests in range(100, 500_001, 100):
api_cost = calculate_api_cost(model, daily_requests, avg_input_tokens, avg_output_tokens)
# How many GPUs needed?
hourly_requests = daily_requests / 24
gpus_needed = max(1, int(hourly_requests / throughput_per_gpu) + 1)
gpu_cost = calculate_gpu_cost(gpu_type, gpus_needed)
# Add engineering overhead for self-hosting ($15K/month for infra engineer)
self_host_total = gpu_cost["monthly_reserved"] + 15000
if api_cost["total_monthly_cost"] > self_host_total:
return {
"breakeven_daily_requests": daily_requests,
"api_cost_at_breakeven": api_cost["total_monthly_cost"],
"gpu_cost_at_breakeven": self_host_total,
"gpus_needed": gpus_needed,
}
return {"breakeven": "API is cheaper at all tested volumes"}
Cost Optimization Techniques
1. Model Routing (Cascading)
Route requests to cheaper models when possible, escalating to expensive models only when needed:
class ModelRouter:
def __init__(self):
self.models = [
{"name": "gemini-2.0-flash", "cost_tier": "cheap", "quality": "good"},
{"name": "gpt-4o-mini", "cost_tier": "medium", "quality": "better"},
{"name": "gpt-4o", "cost_tier": "expensive", "quality": "best"},
]
def route(self, query, complexity_score):
"""Route based on query complexity."""
if complexity_score < 0.3:
return self.models[0] # Simple queries → cheapest model
elif complexity_score < 0.7:
return self.models[1] # Medium queries → mid-tier
else:
return self.models[2] # Complex queries → premium
def estimate_complexity(self, query):
"""Classify query complexity using a cheap model."""
prompt = f"Rate this query's complexity 0.0-1.0: {query}"
score = float(cheap_model.generate(prompt, max_tokens=5))
return score
Impact: 40-60% cost reduction with minimal quality loss for mixed workloads.
2. Prompt Caching
import hashlib
from functools import lru_cache
class PromptCache:
def __init__(self, ttl_seconds=3600):
self.cache = {}
self.ttl = ttl_seconds
def get_or_generate(self, prompt, model, temperature=0):
"""Cache responses for deterministic prompts."""
if temperature > 0:
return None # Don't cache non-deterministic responses
cache_key = hashlib.sha256(
f"{model}:{prompt}".encode()
).hexdigest()
if cache_key in self.cache:
entry = self.cache[cache_key]
if time.time() - entry["timestamp"] < self.ttl:
return entry["response"]
response = model.generate(prompt, temperature=0)
self.cache[cache_key] = {
"response": response,
"timestamp": time.time(),
}
return response
Impact: 20-40% cost reduction for applications with repeated similar queries.
3. Model Quantization
Reduce model size and inference cost by lowering precision:
| Precision | Model Size | Speed | Quality Loss | Best For |
|---|---|---|---|---|
| FP32 (full) | 100% | 1x | None | Research, training |
| FP16 (half) | 50% | 1.5-2x | Negligible | Standard inference |
| INT8 | 25% | 2-3x | Minimal (< 1%) | Production inference |
| INT4 (GPTQ/AWQ) | 12.5% | 3-4x | Small (1-3%) | Cost-sensitive, edge |
| GGUF (2-bit) | 6% | 4-5x | Moderate (3-8%) | Edge devices, mobile |
# Quantize with AutoGPTQ
from auto_gptq import AutoGPTQForCausalLM
model = AutoGPTQForCausalLM.from_quantized(
"TheBloke/Llama-2-70B-GPTQ",
model_basename="model",
use_safetensors=True,
trust_remote_code=True,
device="cuda:0",
quantize_config=None,
)
4. Batch Processing
async def batch_inference(requests, batch_size=20, max_concurrent=5):
"""Process requests in batches for throughput optimization."""
semaphore = asyncio.Semaphore(max_concurrent)
results = []
for i in range(0, len(requests), batch_size):
batch = requests[i:i + batch_size]
async with semaphore:
batch_results = await asyncio.gather(
*[process_request(req) for req in batch]
)
results.extend(batch_results)
return results
# For non-time-sensitive workloads, batch and use cheaper models
# 10K emails to summarize? Batch at 100, use GPT-4o-mini → 90% cheaper than real-time GPT-4o
Decision Framework
Start Here
│
┌─────────┴─────────┐
│ < 1000 req/day? │
└─────────┬─────────┘
Yes │ No
│ │ │
API ← │ ├── Data residency required?
│ │ Yes → Self-host
│ │ No ↓
│ ├── > 50K req/day?
│ │ Yes → Self-host (H100/A100)
│ │ No ↓
│ ├── Latency < 50ms required?
│ │ Yes → Edge (quantized)
│ │ No ↓
│ └── API with model routing + caching
Anti-Patterns
| Anti-Pattern | Problem | Fix |
|---|---|---|
| One model for everything | Paying premium prices for simple tasks | Implement model routing by complexity |
| No cost monitoring | Spending spikes go unnoticed until the bill arrives | Set up daily cost alerts and per-user budgets |
| Premature self-hosting | Building GPU infra at 500 req/day to “save money” | Stay on APIs until break-even analysis proves self-hosting ROI |
| Ignoring engineering costs | GPU cost is cheap, but the ML engineer maintaining it costs $15K/month | Include staffing in total cost of ownership |
| No caching | Generating identical responses repeatedly | Cache deterministic responses (temperature=0) |
| Over-provisioning GPUs | Running H100s at 20% utilization | Right-size GPU selection, use auto-scaling, consider spot instances |
AI Cost Optimization Checklist
- Cost model built with per-request and monthly projections
- API vs GPU break-even analysis completed at projected volume
- Model routing implemented (cheap model for simple, premium for complex)
- Prompt caching deployed for deterministic responses
- Token usage optimized (shorter prompts, efficient encoding)
- Batch processing pipeline for non-real-time workloads
- Cost monitoring with daily alerts and per-user budgets
- Quantization evaluated for self-hosted models
- Auto-scaling configured for GPU clusters (scale to zero when idle)
- Reserved/spot instances used for predictable workloads
- Monthly cost review meeting with engineering + finance
- 6-month cost forecast updated quarterly
:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For AI cost optimization consulting, visit garnetgrid.com. :::