Observability Engineering: Beyond Monitoring
Build observability into your systems from the ground up. Covers the three pillars (metrics, logs, traces), structured logging, custom instrumentation, the difference between monitoring and observability, and building a culture where teams can debug production issues without escalation.
Monitoring tells you when something is broken. Observability tells you why. Monitoring is predefined dashboards and alerts for known failure modes. Observability is the ability to ask arbitrary questions about your system’s behavior — questions you did not anticipate when you built it.
The distinction matters because production fails in ways you have not imagined. A monitoring-only approach catches only the failures you predicted. Observability catches the ones you did not.
The Three Pillars
Metrics
Metrics are numeric measurements aggregated over time:
http_request_duration_seconds{method="GET", path="/api/orders", status="200"}
→ P50: 45ms, P95: 180ms, P99: 890ms
http_requests_total{method="GET", path="/api/orders", status="500"}
→ 47 errors in the last 5 minutes
Strengths: Cheap to store, fast to query, excellent for aggregation and alerting. Weakness: Low cardinality — you cannot drill into individual requests.
Logs
Logs are timestamped records of events:
{
"timestamp": "2026-03-04T15:23:47Z",
"level": "ERROR",
"service": "order-service",
"trace_id": "abc123def456",
"user_id": "usr_789",
"message": "Payment failed",
"error": "Card declined",
"stripe_error_code": "card_declined",
"order_id": "ord_456",
"amount": 2499
}
Strengths: Rich context, high cardinality, searchable. Weakness: Expensive to store at scale, requires structured formatting to be useful.
Traces
Traces follow a single request across multiple services:
Trace abc123def456:
API Gateway [12ms] →
Order Service [45ms] →
Payment Service [890ms] ← SLOW
Stripe API [850ms] ← ROOT CAUSE
Inventory Check [8ms]
Strengths: Shows exactly where time is spent across service boundaries. Weakness: Sampling required at scale, complex to implement.
Using Them Together
Metrics tell you: “P99 latency spiked to 2 seconds at 3:15 PM” Traces tell you: “The slow requests are all hitting the Payment Service” Logs tell you: “Stripe is returning 429 rate limit errors from IP 54.23.x.x”
Structured Logging
Unstructured logs are human-readable but machine-hostile:
ERROR 2026-03-04 15:23:47 - Order 456 failed because payment was declined for user 789
Structured logs are both:
{
"level": "ERROR",
"timestamp": "2026-03-04T15:23:47Z",
"event": "order_payment_failed",
"order_id": "456",
"user_id": "789",
"reason": "card_declined"
}
The structured version is queryable:
-- Find all payment failures for user 789
SELECT * FROM logs WHERE event = 'order_payment_failed' AND user_id = '789'
Log Levels
DEBUG: Detailed internal state (disable in production)
INFO: Normal but significant events (request processed, job completed)
WARN: Unexpected but handled events (retry succeeded, fallback used)
ERROR: Failures that need attention (request failed, data inconsistency)
FATAL: Service cannot continue (startup failure, unrecoverable state)
What to Log
Always log: Requests received, responses sent, errors, state transitions, business events. Never log: Passwords, tokens, PII (personal data), credit card numbers, full request bodies with sensitive fields.
Custom Instrumentation
Auto-instrumentation captures infrastructure. Manual instrumentation captures what matters to your business:
from prometheus_client import Counter, Histogram, Gauge
# Business metrics
orders_created = Counter('orders_created_total', 'Orders created', ['payment_method'])
order_value = Histogram('order_value_dollars', 'Order value',
buckets=[10, 50, 100, 500, 1000, 5000])
active_carts = Gauge('active_carts', 'Shopping carts with items')
def create_order(order):
orders_created.labels(payment_method=order.payment_method).inc()
order_value.observe(order.total)
active_carts.dec()
The RED Method (Request-Driven)
For every service, instrument:
- Rate: Requests per second
- Errors: Failed requests per second
- Duration: Distribution of request latency
The USE Method (Resource-Driven)
For every resource (CPU, memory, disk, network), instrument:
- Utilization: Percentage of resource in use
- Saturation: Queue depth of work waiting
- Errors: Error events on the resource
Alerting Philosophy
Alert on Symptoms, Not Causes
# BAD: Alert on cause (too many alerts, not actionable)
- alert: HighCPU
expr: node_cpu_usage > 80%
# GOOD: Alert on symptom (user-facing impact)
- alert: HighLatency
expr: histogram_quantile(0.99, rate(http_duration_seconds_bucket[5m])) > 2
for: 5m
annotations:
summary: "P99 latency exceeds 2 seconds"
If CPU is 90% but latency is normal and error rate is zero, nothing is broken. Do not wake someone up.
Alert Severity
Critical: Customer-facing impact NOW → Page on-call
Warning: Degradation detected, may become critical → Slack notification
Info: Noteworthy but not actionable → Dashboard annotation
Building Observability Culture
Technology is necessary but not sufficient. The culture shift is harder:
- Developers own production observability — not the SRE team, not the monitoring team
- Every feature ships with instrumentation — part of the definition of done
- Runbooks are living documents — updated after every incident
- Dashboards are team-specific — each team maintains their service dashboards
- Observability debt is tracked — gaps in instrumentation are logged and prioritized
Anti-Patterns
| Anti-Pattern | Impact | Fix |
|---|---|---|
| Monitoring without observability | Can only debug known failure modes | Add traces and structured logs |
| Unstructured logs | Cannot query or correlate | JSON structured logging everywhere |
| Alerting on every metric | Alert fatigue, nothing gets investigated | Alert on symptoms, not causes |
| Centralized observability team | Teams cannot debug their own services | Embed observability in every team |
| No business metrics | Cannot connect technical issues to revenue impact | Instrument business events |
Observability is not a tool you buy — it is a practice you build. The tools (Prometheus, Grafana, Jaeger, Loki) are commodities. The practice — asking good questions, instrumenting the right things, reducing time-to-understanding — is what separates teams that resolve incidents in minutes from teams that resolve them in hours.