ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

Observability Engineering: Beyond Monitoring

Build observability into your systems from the ground up. Covers the three pillars (metrics, logs, traces), structured logging, custom instrumentation, the difference between monitoring and observability, and building a culture where teams can debug production issues without escalation.

Monitoring tells you when something is broken. Observability tells you why. Monitoring is predefined dashboards and alerts for known failure modes. Observability is the ability to ask arbitrary questions about your system’s behavior — questions you did not anticipate when you built it.

The distinction matters because production fails in ways you have not imagined. A monitoring-only approach catches only the failures you predicted. Observability catches the ones you did not.


The Three Pillars

Metrics

Metrics are numeric measurements aggregated over time:

http_request_duration_seconds{method="GET", path="/api/orders", status="200"}
  → P50: 45ms, P95: 180ms, P99: 890ms

http_requests_total{method="GET", path="/api/orders", status="500"}
  → 47 errors in the last 5 minutes

Strengths: Cheap to store, fast to query, excellent for aggregation and alerting. Weakness: Low cardinality — you cannot drill into individual requests.

Logs

Logs are timestamped records of events:

{
  "timestamp": "2026-03-04T15:23:47Z",
  "level": "ERROR",
  "service": "order-service",
  "trace_id": "abc123def456",
  "user_id": "usr_789",
  "message": "Payment failed",
  "error": "Card declined",
  "stripe_error_code": "card_declined",
  "order_id": "ord_456",
  "amount": 2499
}

Strengths: Rich context, high cardinality, searchable. Weakness: Expensive to store at scale, requires structured formatting to be useful.

Traces

Traces follow a single request across multiple services:

Trace abc123def456:
  API Gateway      [12ms] → 
  Order Service    [45ms] → 
  Payment Service  [890ms] ← SLOW
    Stripe API     [850ms] ← ROOT CAUSE
  Inventory Check  [8ms]

Strengths: Shows exactly where time is spent across service boundaries. Weakness: Sampling required at scale, complex to implement.

Using Them Together

Metrics tell you: “P99 latency spiked to 2 seconds at 3:15 PM” Traces tell you: “The slow requests are all hitting the Payment Service” Logs tell you: “Stripe is returning 429 rate limit errors from IP 54.23.x.x”


Structured Logging

Unstructured logs are human-readable but machine-hostile:

ERROR 2026-03-04 15:23:47 - Order 456 failed because payment was declined for user 789

Structured logs are both:

{
  "level": "ERROR",
  "timestamp": "2026-03-04T15:23:47Z",
  "event": "order_payment_failed",
  "order_id": "456",
  "user_id": "789",
  "reason": "card_declined"
}

The structured version is queryable:

-- Find all payment failures for user 789
SELECT * FROM logs WHERE event = 'order_payment_failed' AND user_id = '789'

Log Levels

DEBUG:   Detailed internal state (disable in production)
INFO:    Normal but significant events (request processed, job completed)
WARN:    Unexpected but handled events (retry succeeded, fallback used)
ERROR:   Failures that need attention (request failed, data inconsistency)
FATAL:   Service cannot continue (startup failure, unrecoverable state)

What to Log

Always log: Requests received, responses sent, errors, state transitions, business events. Never log: Passwords, tokens, PII (personal data), credit card numbers, full request bodies with sensitive fields.


Custom Instrumentation

Auto-instrumentation captures infrastructure. Manual instrumentation captures what matters to your business:

from prometheus_client import Counter, Histogram, Gauge

# Business metrics
orders_created = Counter('orders_created_total', 'Orders created', ['payment_method'])
order_value = Histogram('order_value_dollars', 'Order value', 
                       buckets=[10, 50, 100, 500, 1000, 5000])
active_carts = Gauge('active_carts', 'Shopping carts with items')

def create_order(order):
    orders_created.labels(payment_method=order.payment_method).inc()
    order_value.observe(order.total)
    active_carts.dec()

The RED Method (Request-Driven)

For every service, instrument:

  • Rate: Requests per second
  • Errors: Failed requests per second
  • Duration: Distribution of request latency

The USE Method (Resource-Driven)

For every resource (CPU, memory, disk, network), instrument:

  • Utilization: Percentage of resource in use
  • Saturation: Queue depth of work waiting
  • Errors: Error events on the resource

Alerting Philosophy

Alert on Symptoms, Not Causes

# BAD: Alert on cause (too many alerts, not actionable)
- alert: HighCPU
  expr: node_cpu_usage > 80%

# GOOD: Alert on symptom (user-facing impact)
- alert: HighLatency
  expr: histogram_quantile(0.99, rate(http_duration_seconds_bucket[5m])) > 2
  for: 5m
  annotations:
    summary: "P99 latency exceeds 2 seconds"

If CPU is 90% but latency is normal and error rate is zero, nothing is broken. Do not wake someone up.

Alert Severity

Critical: Customer-facing impact NOW → Page on-call
Warning:  Degradation detected, may become critical → Slack notification
Info:     Noteworthy but not actionable → Dashboard annotation

Building Observability Culture

Technology is necessary but not sufficient. The culture shift is harder:

  1. Developers own production observability — not the SRE team, not the monitoring team
  2. Every feature ships with instrumentation — part of the definition of done
  3. Runbooks are living documents — updated after every incident
  4. Dashboards are team-specific — each team maintains their service dashboards
  5. Observability debt is tracked — gaps in instrumentation are logged and prioritized

Anti-Patterns

Anti-PatternImpactFix
Monitoring without observabilityCan only debug known failure modesAdd traces and structured logs
Unstructured logsCannot query or correlateJSON structured logging everywhere
Alerting on every metricAlert fatigue, nothing gets investigatedAlert on symptoms, not causes
Centralized observability teamTeams cannot debug their own servicesEmbed observability in every team
No business metricsCannot connect technical issues to revenue impactInstrument business events

Observability is not a tool you buy — it is a practice you build. The tools (Prometheus, Grafana, Jaeger, Loki) are commodities. The practice — asking good questions, instrumenting the right things, reducing time-to-understanding — is what separates teams that resolve incidents in minutes from teams that resolve them in hours.

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →