Observability Engineering: Beyond Monitoring

Monitoring tells you when something is broken. Observability tells you why. Monitoring is predefined dashboards and alerts for known failure modes. Observability is the ability to ask arbitrary questions about your system’s behavior — questions you did not anticipate when you built it.

The distinction matters because production fails in ways you have not imagined. A monitoring-only approach catches only the failures you predicted. Observability catches the ones you did not.

The Three Pillars

Metrics

Metrics are numeric measurements aggregated over time:

http_request_duration_seconds{method="GET", path="/api/orders", status="200"}
  → P50: 45ms, P95: 180ms, P99: 890ms

http_requests_total{method="GET", path="/api/orders", status="500"}
  → 47 errors in the last 5 minutes

Strengths: Cheap to store, fast to query, excellent for aggregation and alerting. Weakness: Low cardinality — you cannot drill into individual requests.

Logs

Logs are timestamped records of events:

{
  "timestamp": "2026-03-04T15:23:47Z",
  "level": "ERROR",
  "service": "order-service",
  "trace_id": "abc123def456",
  "user_id": "usr_789",
  "message": "Payment failed",
  "error": "Card declined",
  "stripe_error_code": "card_declined",
  "order_id": "ord_456",
  "amount": 2499
}

Strengths: Rich context, high cardinality, searchable. Weakness: Expensive to store at scale, requires structured formatting to be useful.

Traces

Traces follow a single request across multiple services:

Trace abc123def456:
  API Gateway      [12ms] → 
  Order Service    [45ms] → 
  Payment Service  [890ms] ← SLOW
    Stripe API     [850ms] ← ROOT CAUSE
  Inventory Check  [8ms]

Strengths: Shows exactly where time is spent across service boundaries. Weakness: Sampling required at scale, complex to implement.

Using Them Together

Metrics tell you: “P99 latency spiked to 2 seconds at 3:15 PM” Traces tell you: “The slow requests are all hitting the Payment Service” Logs tell you: “Stripe is returning 429 rate limit errors from IP 54.23.x.x”

Structured Logging

Unstructured logs are human-readable but machine-hostile:

ERROR 2026-03-04 15:23:47 - Order 456 failed because payment was declined for user 789

Structured logs are both:

{
  "level": "ERROR",
  "timestamp": "2026-03-04T15:23:47Z",
  "event": "order_payment_failed",
  "order_id": "456",
  "user_id": "789",
  "reason": "card_declined"
}

The structured version is queryable:

-- Find all payment failures for user 789
SELECT * FROM logs WHERE event = 'order_payment_failed' AND user_id = '789'

Log Levels

DEBUG:   Detailed internal state (disable in production)
INFO:    Normal but significant events (request processed, job completed)
WARN:    Unexpected but handled events (retry succeeded, fallback used)
ERROR:   Failures that need attention (request failed, data inconsistency)
FATAL:   Service cannot continue (startup failure, unrecoverable state)

What to Log

Always log: Requests received, responses sent, errors, state transitions, business events. Never log: Passwords, tokens, PII (personal data), credit card numbers, full request bodies with sensitive fields.

Custom Instrumentation

Auto-instrumentation captures infrastructure. Manual instrumentation captures what matters to your business:

from prometheus_client import Counter, Histogram, Gauge

# Business metrics
orders_created = Counter('orders_created_total', 'Orders created', ['payment_method'])
order_value = Histogram('order_value_dollars', 'Order value', 
                       buckets=[10, 50, 100, 500, 1000, 5000])
active_carts = Gauge('active_carts', 'Shopping carts with items')

def create_order(order):
    orders_created.labels(payment_method=order.payment_method).inc()
    order_value.observe(order.total)
    active_carts.dec()

The RED Method (Request-Driven)

For every service, instrument:

Rate: Requests per second
Errors: Failed requests per second
Duration: Distribution of request latency

The USE Method (Resource-Driven)

For every resource (CPU, memory, disk, network), instrument:

Utilization: Percentage of resource in use
Saturation: Queue depth of work waiting
Errors: Error events on the resource

Alerting Philosophy

Alert on Symptoms, Not Causes

# BAD: Alert on cause (too many alerts, not actionable)
- alert: HighCPU
  expr: node_cpu_usage > 80%

# GOOD: Alert on symptom (user-facing impact)
- alert: HighLatency
  expr: histogram_quantile(0.99, rate(http_duration_seconds_bucket[5m])) > 2
  for: 5m
  annotations:
    summary: "P99 latency exceeds 2 seconds"

If CPU is 90% but latency is normal and error rate is zero, nothing is broken. Do not wake someone up.

Alert Severity

Critical: Customer-facing impact NOW → Page on-call
Warning:  Degradation detected, may become critical → Slack notification
Info:     Noteworthy but not actionable → Dashboard annotation

Building Observability Culture

Technology is necessary but not sufficient. The culture shift is harder:

Developers own production observability — not the SRE team, not the monitoring team
Every feature ships with instrumentation — part of the definition of done
Runbooks are living documents — updated after every incident
Dashboards are team-specific — each team maintains their service dashboards
Observability debt is tracked — gaps in instrumentation are logged and prioritized

Anti-Patterns

Anti-Pattern	Impact	Fix
Monitoring without observability	Can only debug known failure modes	Add traces and structured logs
Unstructured logs	Cannot query or correlate	JSON structured logging everywhere
Alerting on every metric	Alert fatigue, nothing gets investigated	Alert on symptoms, not causes
Centralized observability team	Teams cannot debug their own services	Embed observability in every team
No business metrics	Cannot connect technical issues to revenue impact	Instrument business events

Observability is not a tool you buy — it is a practice you build. The tools (Prometheus, Grafana, Jaeger, Loki) are commodities. The practice — asking good questions, instrumenting the right things, reducing time-to-understanding — is what separates teams that resolve incidents in minutes from teams that resolve them in hours.