Verified by Garnet Grid

Observability Stack: Logs, Metrics, Traces Unified

Design a complete observability stack. Covers the three pillars (logs, metrics, traces), tool selection (Grafana, Datadog, OpenTelemetry), SLOs, alerting, and dashboard design for production systems.

You can’t fix what you can’t see. Observability gives you the ability to understand why a system is behaving a certain way from its external outputs — logs, metrics, and traces. Monitoring tells you when something is broken; observability helps you figure out why. This guide covers how to build an observability stack that actually helps you debug production issues at 3am instead of generating alert fatigue.

The difference between monitoring and observability is the difference between “is the system up?” and “why is this specific request timing out for users in Europe?”


The Three Pillars

PillarWhatAnswersExample ToolsData Shape
LogsDiscrete events with context”What happened?”Loki, Elasticsearch, CloudWatch{timestamp, level, message, metadata}
MetricsNumeric measurements over time”How much? How fast?”Prometheus, Datadog, CloudWatchmetric_name{labels} = value @ time
TracesRequest flow across services”Where is the bottleneck?”Jaeger, Tempo, X-Rayspan{trace_id, parent_id, duration}

How They Work Together

User reports: "The checkout page is slow"

1. METRICS → p99 latency spiked from 200ms to 2s at 14:30
2. TRACES → Slow requests are spending 1.8s in the payment service
3. LOGS → Payment service logs show "connection pool exhausted" at 14:28
4. ROOT CAUSE → Database connection pool maxed out due to a slow query

OpenTelemetry (The Standard)

OpenTelemetry (OTel) is the vendor-neutral CNCF standard for instrumenting applications. Instrument once, send to any backend.

# Python: Auto-instrumentation with OpenTelemetry
from opentelemetry import trace, metrics
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Setup tracing
provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
)
trace.set_tracer_provider(provider)

# Auto-instrument frameworks (zero code changes needed)
FastAPIInstrumentor.instrument()
SQLAlchemyInstrumentor().instrument()
RequestsInstrumentor().instrument()

# Custom spans for business logic
tracer = trace.get_tracer(__name__)

@app.post("/orders")
async def create_order(order: OrderRequest):
    with tracer.start_as_current_span("create_order") as span:
        span.set_attribute("order.customer_id", order.customer_id)
        span.set_attribute("order.total", order.total)
        span.set_attribute("order.item_count", len(order.items))

        with tracer.start_as_current_span("validate_inventory"):
            check_inventory(order.items)

        with tracer.start_as_current_span("process_payment"):
            charge_card(order.payment)

        return {"order_id": order.id}

OTel Collector Architecture

Service A ──┐                    ┌──→ Prometheus (metrics)
Service B ──┼──→ OTel Collector ─┼──→ Loki (logs)
Service C ──┘    (pipeline)      └──→ Tempo/Jaeger (traces)

Benefits:
- Single exporter endpoint for all services
- Vendor-agnostic (switch backends without code changes)
- Batching, retry, sampling built-in
- Can transform/filter data before export

Structured Logging

import structlog

logger = structlog.get_logger()

# ✅ Structured logging — searchable, parseable, aggregatable
logger.info("order_created",
    order_id="ord-123",
    customer_id="cust-456",
    total=99.99,
    items_count=3,
    payment_method="card",
    processing_time_ms=245
)
# Output: {"event": "order_created", "order_id": "ord-123",
#          "customer_id": "cust-456", "total": 99.99, ...}

# ❌ Unstructured logging — impossible to filter/aggregate
logger.info(f"Created order ord-123 for customer cust-456, total $99.99")

Log Levels

LevelUse ForExampleAlert?
DEBUGDevelopment only (never in prod)Variable values, flow tracingNo
INFONormal operations”Order created”, “User logged in”No
WARNDegraded but working”Retry attempt 2/3”, “Cache miss”Aggregate (> threshold)
ERRORFailed operation”Payment failed”, “DB connection refused”Yes (P2)
FATALSystem cannot continue”Out of memory”, “Config missing”Yes (P1, page on-call)

Metrics: The RED & USE Methods

RED Method (for request-driven services)

MetricWhat to TrackAlert OnExample Query (PromQL)
RateRequests per secondSudden drop or spikerate(http_requests_total[5m])
ErrorError rate (% failing)> 1% error raterate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])
DurationRequest latency (p50, p95, p99)p99 > 500mshistogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

USE Method (for infrastructure)

MetricWhat to TrackAlert OnExample
UtilizationCPU, memory, disk, network %> 80% sustained (15 min)node_cpu_seconds_total
SaturationQueue depth, thread pool exhaustionGrowing queues (> 5 min)thread_pool_active / thread_pool_max
ErrorsHardware errors, OOM kills, disk failuresAny occurrencenode_vmstat_oom_kill

SLOs (Service Level Objectives)

slos:
  - name: "API Availability"
    target: 99.9%  # 8.76 hours downtime/year budget
    indicator: "Successful requests / Total requests"
    window: 30 days

  - name: "API Latency"
    target: "95% of requests < 200ms"
    indicator: "p95 latency"
    window: 30 days

  - name: "Data Pipeline Freshness"
    target: "99% of tables updated within 2 hours"
    indicator: "Tables with freshness < 2h / Total tables"
    window: 7 days

Error Budget

SLO: 99.9% availability = 43.83 minutes downtime / month

If you've burned 30 minutes this month:
→ 13.83 minutes remaining
→ FREEZE risky deployments
→ Prioritize reliability work

If you've burned 0 minutes:
→ Full budget available
→ Ship features aggressively
→ Take calculated risks (bigger deploys, experiments)

SLO Tiers

SLO TargetAnnual DowntimeMonthly BudgetRequires
99%3.65 days7.3 hoursBasic monitoring
99.9%8.76 hours43.8 minutesActive alerting, redundancy
99.95%4.38 hours21.9 minutesMulti-AZ, auto-failover
99.99%52.6 minutes4.4 minutesMulti-region, active-active

Tool Selection

StackBest ForMonthly Cost (50 hosts)Complexity
Grafana + Prometheus + Loki + TempoFull control, cost-efficient, OSS$0 (self-hosted infra)High (operate yourself)
Grafana CloudOSS stack, managed$200-$2,000Medium
DatadogAll-in-one, great UX, fast setup$2,000-$10,000Low
New RelicAPM-focused, simple pricing$1,000-$5,000Low
AWS CloudWatch + X-RayAWS-native, no extra infra$500-$3,000Medium
Elastic (ELK)Log-heavy workloads$1,000-$5,000High (self-hosted)

Alerting Best Practices

RuleWhyExample
Alert on symptoms, not causesUsers care about symptoms”API error rate > 5%” not “CPU > 80%“
Every alert must have a runbook link3am you won’t rememberAlert includes link to runbooks/api-errors.md
Use severity levelsNot everything is a pageP1 (page), P2 (Slack), P3 (ticket)
Deduplicate and group alerts100 alerts for 1 incident is noiseGroup by service + error type
Review alert fatigue monthlyIgnored alerts = no alertsTrack alert-to-action ratio
Set up escalation chainsIf page not ack’d in 10 minOn-call → backup → team lead → eng manager

Alert Anti-Pattern: The Boy Who Cried Wolf

Month 1: 200 alerts → team responds to all
Month 2: 200 alerts → team responds to 150
Month 3: 200 alerts → team ignores most
Month 4: Real outage buried in noise → 2-hour delayed response

Fix: Reduce to < 30 actionable alerts/month
If an alert never leads to action, delete it.

Dashboard Design

The Four Golden Signals Dashboard

Every service should have a dashboard showing:

  1. Latency: p50, p95, p99 over time
  2. Traffic: Requests per second
  3. Errors: Error rate (%) and error count by type
  4. Saturation: CPU, memory, connection pool utilization

Dashboard Hierarchy

LevelAudienceContent
ExecutiveVP/CTOSLO status (green/yellow/red), uptime, error budget
ServiceOn-call engineerGolden signals per service, dependency health
DebugInvestigating engineerDetailed traces, log correlation, resource metrics

Stack Selection by Company Stage

Company StageRecommended StackMonthly BudgetWhy
Startup (under 10 eng)Grafana Cloud free tier + Sentry$0-$100Maximum value, minimal ops
Growth (10-50 eng)Datadog or Grafana Cloud Pro$500-$5KUnified platform, less toolchain management
Scale (50-200 eng)Datadog Enterprise or self-hosted Grafana stack$5K-$50KCustom dashboards, high cardinality
Enterprise (200+ eng)Splunk or Dynatrace or full self-hosted stack$50K+Compliance, data sovereignty, scale

Alert Fatigue Prevention

Alert fatigue is the number one observability failure mode. Prevent it by:

  • Alert on symptoms, not causes — Alert on error rate above 5 percent not CPU above 80 percent
  • Require runbooks — Every alert must have a linked runbook explaining what to do
  • Review weekly — Delete alerts that have not fired in 90 days or fire too frequently
  • Use severity levels — P1 (pages), P2 (Slack), P3 (ticket), P4 (dashboard only)
  • Target under 5 pages per week — More than this and on-call engineers stop responding

Checklist

  • OpenTelemetry SDK integrated (auto + custom instrumentation)
  • OTel Collector deployed (vendor-neutral pipeline)
  • Structured logging with consistent JSON schema
  • RED metrics on all request-driven services
  • USE metrics on all infrastructure
  • SLOs defined for critical services with error budgets
  • Error budgets tracked and visible to engineering leadership
  • Golden signals dashboard for each service
  • Alerting with severity levels, runbook links, and escalation
  • Distributed tracing across service boundaries
  • Monthly alert review to reduce fatigue (< 30 actionable/month)
  • Dashboard hierarchy (executive → service → debug)

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For observability consulting, visit garnetgrid.com. :::

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →