Observability Stack: Logs, Metrics, Traces Unified

You can’t fix what you can’t see. Observability gives you the ability to understand why a system is behaving a certain way from its external outputs — logs, metrics, and traces. Monitoring tells you when something is broken; observability helps you figure out why. This guide covers how to build an observability stack that actually helps you debug production issues at 3am instead of generating alert fatigue.

The difference between monitoring and observability is the difference between “is the system up?” and “why is this specific request timing out for users in Europe?”

The Three Pillars

Pillar	What	Answers	Example Tools	Data Shape
Logs	Discrete events with context	”What happened?”	Loki, Elasticsearch, CloudWatch	`{timestamp, level, message, metadata}`
Metrics	Numeric measurements over time	”How much? How fast?”	Prometheus, Datadog, CloudWatch	`metric_name{labels} = value @ time`
Traces	Request flow across services	”Where is the bottleneck?”	Jaeger, Tempo, X-Ray	`span{trace_id, parent_id, duration}`

How They Work Together

User reports: "The checkout page is slow"

1. METRICS → p99 latency spiked from 200ms to 2s at 14:30
2. TRACES → Slow requests are spending 1.8s in the payment service
3. LOGS → Payment service logs show "connection pool exhausted" at 14:28
4. ROOT CAUSE → Database connection pool maxed out due to a slow query

OpenTelemetry (The Standard)

OpenTelemetry (OTel) is the vendor-neutral CNCF standard for instrumenting applications. Instrument once, send to any backend.

# Python: Auto-instrumentation with OpenTelemetry
from opentelemetry import trace, metrics
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Setup tracing
provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
)
trace.set_tracer_provider(provider)

# Auto-instrument frameworks (zero code changes needed)
FastAPIInstrumentor.instrument()
SQLAlchemyInstrumentor().instrument()
RequestsInstrumentor().instrument()

# Custom spans for business logic
tracer = trace.get_tracer(__name__)

@app.post("/orders")
async def create_order(order: OrderRequest):
    with tracer.start_as_current_span("create_order") as span:
        span.set_attribute("order.customer_id", order.customer_id)
        span.set_attribute("order.total", order.total)
        span.set_attribute("order.item_count", len(order.items))

        with tracer.start_as_current_span("validate_inventory"):
            check_inventory(order.items)

        with tracer.start_as_current_span("process_payment"):
            charge_card(order.payment)

        return {"order_id": order.id}

OTel Collector Architecture

Service A ──┐                    ┌──→ Prometheus (metrics)
Service B ──┼──→ OTel Collector ─┼──→ Loki (logs)
Service C ──┘    (pipeline)      └──→ Tempo/Jaeger (traces)

Benefits:
- Single exporter endpoint for all services
- Vendor-agnostic (switch backends without code changes)
- Batching, retry, sampling built-in
- Can transform/filter data before export

Structured Logging

import structlog

logger = structlog.get_logger()

# ✅ Structured logging — searchable, parseable, aggregatable
logger.info("order_created",
    order_id="ord-123",
    customer_id="cust-456",
    total=99.99,
    items_count=3,
    payment_method="card",
    processing_time_ms=245
)
# Output: {"event": "order_created", "order_id": "ord-123",
#          "customer_id": "cust-456", "total": 99.99, ...}

# ❌ Unstructured logging — impossible to filter/aggregate
logger.info(f"Created order ord-123 for customer cust-456, total $99.99")

Log Levels

Level	Use For	Example	Alert?
`DEBUG`	Development only (never in prod)	Variable values, flow tracing	No
`INFO`	Normal operations	”Order created”, “User logged in”	No
`WARN`	Degraded but working	”Retry attempt 2/3”, “Cache miss”	Aggregate (> threshold)
`ERROR`	Failed operation	”Payment failed”, “DB connection refused”	Yes (P2)
`FATAL`	System cannot continue	”Out of memory”, “Config missing”	Yes (P1, page on-call)

Metrics: The RED & USE Methods

RED Method (for request-driven services)

Metric	What to Track	Alert On	Example Query (PromQL)
Rate	Requests per second	Sudden drop or spike	`rate(http_requests_total[5m])`
Error	Error rate (% failing)	> 1% error rate	`rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])`
Duration	Request latency (p50, p95, p99)	p99 > 500ms	`histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))`

USE Method (for infrastructure)

Metric	What to Track	Alert On	Example
Utilization	CPU, memory, disk, network %	> 80% sustained (15 min)	`node_cpu_seconds_total`
Saturation	Queue depth, thread pool exhaustion	Growing queues (> 5 min)	`thread_pool_active / thread_pool_max`
Errors	Hardware errors, OOM kills, disk failures	Any occurrence	`node_vmstat_oom_kill`

SLOs (Service Level Objectives)

slos:
  - name: "API Availability"
    target: 99.9%  # 8.76 hours downtime/year budget
    indicator: "Successful requests / Total requests"
    window: 30 days

  - name: "API Latency"
    target: "95% of requests < 200ms"
    indicator: "p95 latency"
    window: 30 days

  - name: "Data Pipeline Freshness"
    target: "99% of tables updated within 2 hours"
    indicator: "Tables with freshness < 2h / Total tables"
    window: 7 days

Error Budget

SLO: 99.9% availability = 43.83 minutes downtime / month

If you've burned 30 minutes this month:
→ 13.83 minutes remaining
→ FREEZE risky deployments
→ Prioritize reliability work

If you've burned 0 minutes:
→ Full budget available
→ Ship features aggressively
→ Take calculated risks (bigger deploys, experiments)

SLO Tiers

SLO Target	Annual Downtime	Monthly Budget	Requires
99%	3.65 days	7.3 hours	Basic monitoring
99.9%	8.76 hours	43.8 minutes	Active alerting, redundancy
99.95%	4.38 hours	21.9 minutes	Multi-AZ, auto-failover
99.99%	52.6 minutes	4.4 minutes	Multi-region, active-active

Tool Selection

Stack	Best For	Monthly Cost (50 hosts)	Complexity
Grafana + Prometheus + Loki + Tempo	Full control, cost-efficient, OSS	$0 (self-hosted infra)	High (operate yourself)
Grafana Cloud	OSS stack, managed	$200-$2,000	Medium
Datadog	All-in-one, great UX, fast setup	$2,000-$10,000	Low
New Relic	APM-focused, simple pricing	$1,000-$5,000	Low
AWS CloudWatch + X-Ray	AWS-native, no extra infra	$500-$3,000	Medium
Elastic (ELK)	Log-heavy workloads	$1,000-$5,000	High (self-hosted)

Alerting Best Practices

Rule	Why	Example
Alert on symptoms, not causes	Users care about symptoms	”API error rate > 5%” not “CPU > 80%“
Every alert must have a runbook link	3am you won’t remember	Alert includes link to `runbooks/api-errors.md`
Use severity levels	Not everything is a page	P1 (page), P2 (Slack), P3 (ticket)
Deduplicate and group alerts	100 alerts for 1 incident is noise	Group by service + error type
Review alert fatigue monthly	Ignored alerts = no alerts	Track alert-to-action ratio
Set up escalation chains	If page not ack’d in 10 min	On-call → backup → team lead → eng manager

Alert Anti-Pattern: The Boy Who Cried Wolf

Month 1: 200 alerts → team responds to all
Month 2: 200 alerts → team responds to 150
Month 3: 200 alerts → team ignores most
Month 4: Real outage buried in noise → 2-hour delayed response

Fix: Reduce to < 30 actionable alerts/month
If an alert never leads to action, delete it.

Dashboard Design

The Four Golden Signals Dashboard

Every service should have a dashboard showing:

Latency: p50, p95, p99 over time
Traffic: Requests per second
Errors: Error rate (%) and error count by type
Saturation: CPU, memory, connection pool utilization

Dashboard Hierarchy

Level	Audience	Content
Executive	VP/CTO	SLO status (green/yellow/red), uptime, error budget
Service	On-call engineer	Golden signals per service, dependency health
Debug	Investigating engineer	Detailed traces, log correlation, resource metrics

Stack Selection by Company Stage

Company Stage	Recommended Stack	Monthly Budget	Why
Startup (under 10 eng)	Grafana Cloud free tier + Sentry	$0-$100	Maximum value, minimal ops
Growth (10-50 eng)	Datadog or Grafana Cloud Pro	$500-$5K	Unified platform, less toolchain management
Scale (50-200 eng)	Datadog Enterprise or self-hosted Grafana stack	$5K-$50K	Custom dashboards, high cardinality
Enterprise (200+ eng)	Splunk or Dynatrace or full self-hosted stack	$50K+	Compliance, data sovereignty, scale

Alert Fatigue Prevention

Alert fatigue is the number one observability failure mode. Prevent it by:

Alert on symptoms, not causes — Alert on error rate above 5 percent not CPU above 80 percent
Require runbooks — Every alert must have a linked runbook explaining what to do
Review weekly — Delete alerts that have not fired in 90 days or fire too frequently
Use severity levels — P1 (pages), P2 (Slack), P3 (ticket), P4 (dashboard only)
Target under 5 pages per week — More than this and on-call engineers stop responding

Checklist

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For observability consulting, visit garnetgrid.com. :::