You can’t fix what you can’t see. Observability gives you the ability to understand why a system is behaving a certain way from its external outputs — logs, metrics, and traces. Monitoring tells you when something is broken; observability helps you figure out why. This guide covers how to build an observability stack that actually helps you debug production issues at 3am instead of generating alert fatigue.
The difference between monitoring and observability is the difference between “is the system up?” and “why is this specific request timing out for users in Europe?”
The Three Pillars
| Pillar | What | Answers | Example Tools | Data Shape |
|---|
| Logs | Discrete events with context | ”What happened?” | Loki, Elasticsearch, CloudWatch | {timestamp, level, message, metadata} |
| Metrics | Numeric measurements over time | ”How much? How fast?” | Prometheus, Datadog, CloudWatch | metric_name{labels} = value @ time |
| Traces | Request flow across services | ”Where is the bottleneck?” | Jaeger, Tempo, X-Ray | span{trace_id, parent_id, duration} |
How They Work Together
User reports: "The checkout page is slow"
1. METRICS → p99 latency spiked from 200ms to 2s at 14:30
2. TRACES → Slow requests are spending 1.8s in the payment service
3. LOGS → Payment service logs show "connection pool exhausted" at 14:28
4. ROOT CAUSE → Database connection pool maxed out due to a slow query
OpenTelemetry (The Standard)
OpenTelemetry (OTel) is the vendor-neutral CNCF standard for instrumenting applications. Instrument once, send to any backend.
# Python: Auto-instrumentation with OpenTelemetry
from opentelemetry import trace, metrics
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
# Setup tracing
provider = TracerProvider()
provider.add_span_processor(
BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
)
trace.set_tracer_provider(provider)
# Auto-instrument frameworks (zero code changes needed)
FastAPIInstrumentor.instrument()
SQLAlchemyInstrumentor().instrument()
RequestsInstrumentor().instrument()
# Custom spans for business logic
tracer = trace.get_tracer(__name__)
@app.post("/orders")
async def create_order(order: OrderRequest):
with tracer.start_as_current_span("create_order") as span:
span.set_attribute("order.customer_id", order.customer_id)
span.set_attribute("order.total", order.total)
span.set_attribute("order.item_count", len(order.items))
with tracer.start_as_current_span("validate_inventory"):
check_inventory(order.items)
with tracer.start_as_current_span("process_payment"):
charge_card(order.payment)
return {"order_id": order.id}
OTel Collector Architecture
Service A ──┐ ┌──→ Prometheus (metrics)
Service B ──┼──→ OTel Collector ─┼──→ Loki (logs)
Service C ──┘ (pipeline) └──→ Tempo/Jaeger (traces)
Benefits:
- Single exporter endpoint for all services
- Vendor-agnostic (switch backends without code changes)
- Batching, retry, sampling built-in
- Can transform/filter data before export
Structured Logging
import structlog
logger = structlog.get_logger()
# ✅ Structured logging — searchable, parseable, aggregatable
logger.info("order_created",
order_id="ord-123",
customer_id="cust-456",
total=99.99,
items_count=3,
payment_method="card",
processing_time_ms=245
)
# Output: {"event": "order_created", "order_id": "ord-123",
# "customer_id": "cust-456", "total": 99.99, ...}
# ❌ Unstructured logging — impossible to filter/aggregate
logger.info(f"Created order ord-123 for customer cust-456, total $99.99")
Log Levels
| Level | Use For | Example | Alert? |
|---|
DEBUG | Development only (never in prod) | Variable values, flow tracing | No |
INFO | Normal operations | ”Order created”, “User logged in” | No |
WARN | Degraded but working | ”Retry attempt 2/3”, “Cache miss” | Aggregate (> threshold) |
ERROR | Failed operation | ”Payment failed”, “DB connection refused” | Yes (P2) |
FATAL | System cannot continue | ”Out of memory”, “Config missing” | Yes (P1, page on-call) |
Metrics: The RED & USE Methods
RED Method (for request-driven services)
| Metric | What to Track | Alert On | Example Query (PromQL) |
|---|
| Rate | Requests per second | Sudden drop or spike | rate(http_requests_total[5m]) |
| Error | Error rate (% failing) | > 1% error rate | rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) |
| Duration | Request latency (p50, p95, p99) | p99 > 500ms | histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) |
USE Method (for infrastructure)
| Metric | What to Track | Alert On | Example |
|---|
| Utilization | CPU, memory, disk, network % | > 80% sustained (15 min) | node_cpu_seconds_total |
| Saturation | Queue depth, thread pool exhaustion | Growing queues (> 5 min) | thread_pool_active / thread_pool_max |
| Errors | Hardware errors, OOM kills, disk failures | Any occurrence | node_vmstat_oom_kill |
SLOs (Service Level Objectives)
slos:
- name: "API Availability"
target: 99.9% # 8.76 hours downtime/year budget
indicator: "Successful requests / Total requests"
window: 30 days
- name: "API Latency"
target: "95% of requests < 200ms"
indicator: "p95 latency"
window: 30 days
- name: "Data Pipeline Freshness"
target: "99% of tables updated within 2 hours"
indicator: "Tables with freshness < 2h / Total tables"
window: 7 days
Error Budget
SLO: 99.9% availability = 43.83 minutes downtime / month
If you've burned 30 minutes this month:
→ 13.83 minutes remaining
→ FREEZE risky deployments
→ Prioritize reliability work
If you've burned 0 minutes:
→ Full budget available
→ Ship features aggressively
→ Take calculated risks (bigger deploys, experiments)
SLO Tiers
| SLO Target | Annual Downtime | Monthly Budget | Requires |
|---|
| 99% | 3.65 days | 7.3 hours | Basic monitoring |
| 99.9% | 8.76 hours | 43.8 minutes | Active alerting, redundancy |
| 99.95% | 4.38 hours | 21.9 minutes | Multi-AZ, auto-failover |
| 99.99% | 52.6 minutes | 4.4 minutes | Multi-region, active-active |
| Stack | Best For | Monthly Cost (50 hosts) | Complexity |
|---|
| Grafana + Prometheus + Loki + Tempo | Full control, cost-efficient, OSS | $0 (self-hosted infra) | High (operate yourself) |
| Grafana Cloud | OSS stack, managed | $200-$2,000 | Medium |
| Datadog | All-in-one, great UX, fast setup | $2,000-$10,000 | Low |
| New Relic | APM-focused, simple pricing | $1,000-$5,000 | Low |
| AWS CloudWatch + X-Ray | AWS-native, no extra infra | $500-$3,000 | Medium |
| Elastic (ELK) | Log-heavy workloads | $1,000-$5,000 | High (self-hosted) |
Alerting Best Practices
| Rule | Why | Example |
|---|
| Alert on symptoms, not causes | Users care about symptoms | ”API error rate > 5%” not “CPU > 80%“ |
| Every alert must have a runbook link | 3am you won’t remember | Alert includes link to runbooks/api-errors.md |
| Use severity levels | Not everything is a page | P1 (page), P2 (Slack), P3 (ticket) |
| Deduplicate and group alerts | 100 alerts for 1 incident is noise | Group by service + error type |
| Review alert fatigue monthly | Ignored alerts = no alerts | Track alert-to-action ratio |
| Set up escalation chains | If page not ack’d in 10 min | On-call → backup → team lead → eng manager |
Alert Anti-Pattern: The Boy Who Cried Wolf
Month 1: 200 alerts → team responds to all
Month 2: 200 alerts → team responds to 150
Month 3: 200 alerts → team ignores most
Month 4: Real outage buried in noise → 2-hour delayed response
Fix: Reduce to < 30 actionable alerts/month
If an alert never leads to action, delete it.
Dashboard Design
The Four Golden Signals Dashboard
Every service should have a dashboard showing:
- Latency: p50, p95, p99 over time
- Traffic: Requests per second
- Errors: Error rate (%) and error count by type
- Saturation: CPU, memory, connection pool utilization
Dashboard Hierarchy
| Level | Audience | Content |
|---|
| Executive | VP/CTO | SLO status (green/yellow/red), uptime, error budget |
| Service | On-call engineer | Golden signals per service, dependency health |
| Debug | Investigating engineer | Detailed traces, log correlation, resource metrics |
Stack Selection by Company Stage
| Company Stage | Recommended Stack | Monthly Budget | Why |
|---|
| Startup (under 10 eng) | Grafana Cloud free tier + Sentry | $0-$100 | Maximum value, minimal ops |
| Growth (10-50 eng) | Datadog or Grafana Cloud Pro | $500-$5K | Unified platform, less toolchain management |
| Scale (50-200 eng) | Datadog Enterprise or self-hosted Grafana stack | $5K-$50K | Custom dashboards, high cardinality |
| Enterprise (200+ eng) | Splunk or Dynatrace or full self-hosted stack | $50K+ | Compliance, data sovereignty, scale |
Alert Fatigue Prevention
Alert fatigue is the number one observability failure mode. Prevent it by:
- Alert on symptoms, not causes — Alert on error rate above 5 percent not CPU above 80 percent
- Require runbooks — Every alert must have a linked runbook explaining what to do
- Review weekly — Delete alerts that have not fired in 90 days or fire too frequently
- Use severity levels — P1 (pages), P2 (Slack), P3 (ticket), P4 (dashboard only)
- Target under 5 pages per week — More than this and on-call engineers stop responding
Checklist
:::note[Source]
This guide is derived from operational intelligence at Garnet Grid Consulting. For observability consulting, visit garnetgrid.com.
:::