Verified by Garnet Grid

How to Implement Observability: Traces, Metrics, and Logs at Scale

Build a production observability stack. Covers OpenTelemetry instrumentation, Prometheus metrics, distributed tracing, log aggregation, and alerting strategies.

Monitoring tells you something is broken. Observability tells you why. In distributed systems, you can’t debug with console.log. You need traces to follow requests across services, metrics to spot trends, and logs for the details. This guide walks through building a complete observability stack from instrumentation to alerting, using OpenTelemetry as the universal standard.

The key distinction: monitoring is about known-unknowns (“alert me when CPU > 80%”), while observability is about unknown-unknowns (“why are 2% of requests for European users timing out on Tuesdays?”). You need both.


The Three Pillars

┌─────────────┐   ┌─────────────┐   ┌─────────────┐
│   TRACES    │   │   METRICS   │   │    LOGS     │
│             │   │             │   │             │
│ Request flow│   │ Aggregated  │   │ Individual  │
│ across      │   │ time-series │   │ event       │
│ services    │   │ data        │   │ records     │
│             │   │             │   │             │
│ "What path?"│   │ "What trend?"│  │"What detail?"│
└─────────────┘   └─────────────┘   └─────────────┘

When to Use Which

SignalUse WhenExample
TracesDebugging slow requests, understanding service dependencies”This request spent 1.2s in the payment service”
MetricsSetting alerts, tracking SLOs, capacity planning”Error rate is 3.2%, latency p99 is 450ms”
LogsInvestigating specific events, audit trails, debugging”User auth failed: invalid token at 14:32:05”
All threeProduction incident investigationMetric spikes → trace the slow requests → read the error logs

Step 1: Instrument with OpenTelemetry

1.1 Node.js Auto-Instrumentation

// tracing.js — load BEFORE your application code
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');
const { OTLPMetricExporter } = require('@opentelemetry/exporter-metrics-otlp-http');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');

const sdk = new NodeSDK({
  serviceName: 'api-service',
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4318/v1/traces',
  }),
  metricExporter: new OTLPMetricExporter({
    url: 'http://otel-collector:4318/v1/metrics',
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      '@opentelemetry/instrumentation-http': { enabled: true },
      '@opentelemetry/instrumentation-express': { enabled: true },
      '@opentelemetry/instrumentation-pg': { enabled: true },
      '@opentelemetry/instrumentation-redis': { enabled: true },
    }),
  ],
});

sdk.start();
# Run your app with tracing
node --require ./tracing.js app.js

1.2 Python Auto-Instrumentation

from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor

# Configure trace provider
trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
)

# Auto-instrument frameworks (zero code changes to your routes)
FlaskInstrumentor().instrument()
SQLAlchemyInstrumentor().instrument()
RequestsInstrumentor().instrument()

1.3 Custom Instrumentation Best Practices

tracer = trace.get_tracer(__name__)

@app.route("/api/orders")
def create_order():
    with tracer.start_as_current_span("create_order") as span:
        # DO: Add business-relevant attributes
        span.set_attribute("order.customer_id", customer_id)
        span.set_attribute("order.total", total)
        span.set_attribute("order.item_count", len(items))

        # DO: Create child spans for significant operations
        with tracer.start_as_current_span("validate_inventory"):
            check_inventory(items)

        with tracer.start_as_current_span("process_payment"):
            charge_card(payment)

        # DON'T: Instrument every single function call
        # DON'T: Put sensitive data (PII, passwords) in spans
        # DON'T: Create spans inside tight loops

Step 2: Deploy the Collector Stack

# docker-compose.yml — Observability Stack
services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    ports:
      - "4317:4317"   # gRPC
      - "4318:4318"   # HTTP
    volumes:
      - ./otel-config.yaml:/etc/otel/config.yaml
    command: ["--config=/etc/otel/config.yaml"]

  prometheus:
    image: prom/prometheus:latest
    ports: ["9090:9090"]
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml

  grafana:
    image: grafana/grafana:latest
    ports: ["3000:3000"]
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - grafana_data:/var/lib/grafana

  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"  # UI
      - "14268:14268"  # Collector

  loki:
    image: grafana/loki:latest
    ports: ["3100:3100"]

Collector Configuration

# otel-config.yaml
receivers:
  otlp:
    protocols:
      grpc: { endpoint: "0.0.0.0:4317" }
      http: { endpoint: "0.0.0.0:4318" }

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024
  # Memory limiter prevents OOM
  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128

exporters:
  prometheus:
    endpoint: "0.0.0.0:8889"
  jaeger:
    endpoint: "jaeger:14250"
    tls: { insecure: true }
  loki:
    endpoint: "http://loki:3100/loki/api/v1/push"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [jaeger]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheus]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [loki]

Step 3: Define Key Metrics (RED + USE)

RED Method (Request-oriented)

MetricWhat It MeasuresPromQL ExampleAlert Threshold
RateRequests per secondrate(http_requests_total[5m])> 50% drop from baseline
ErrorsError rate percentagerate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])> 1% for 5 minutes
DurationLatency percentileshistogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))p99 > 2s for 5 minutes

USE Method (Resource-oriented)

MetricWhat It MeasuresPromQL ExampleAlert Threshold
UtilizationHow busy is the resource?avg(rate(node_cpu_seconds_total{mode!="idle"}[5m]))> 80% sustained 15 min
SaturationHow much does it queue?node_load1 / count(node_cpu_seconds_total{mode="idle"})> 2.0 for 10 minutes
ErrorsHow often does it fail?rate(node_disk_io_time_weighted_seconds_total[5m])Any disk error

Step 4: Build Custom Metrics

from opentelemetry import metrics

meter = metrics.get_meter("api-service")

# Counter — monotonically increasing (requests, errors, events)
request_counter = meter.create_counter(
    "api.requests",
    description="Total API requests",
    unit="1"
)

# Histogram — distribution of values (latency, response size)
latency_histogram = meter.create_histogram(
    "api.latency",
    description="Request latency in milliseconds",
    unit="ms"
)

# Observable Gauge — current state (connections, queue depth, cache size)
def get_queue_depth(observer):
    observer.observe(queue.qsize(), {"queue": "main"})

meter.create_observable_gauge(
    "api.queue_depth",
    callbacks=[get_queue_depth],
    description="Current queue depth"
)

# Usage in request handler
@app.route("/api/customers")
def list_customers():
    start = time.time()
    try:
        result = db.query("SELECT * FROM customers")
        request_counter.add(1, {"method": "GET", "endpoint": "/customers", "status": "200"})
        return jsonify(result)
    except Exception as e:
        request_counter.add(1, {"method": "GET", "endpoint": "/customers", "status": "500"})
        raise
    finally:
        latency_histogram.record(
            (time.time() - start) * 1000,
            {"method": "GET", "endpoint": "/customers"}
        )

Metric Naming Conventions

PatternExampleDescription
<namespace>.<metric>api.requestsSimple counter
<namespace>.<metric>_totalhttp_requests_totalPrometheus convention for counters
<namespace>.<metric>_secondshttp_request_duration_secondsDuration in seconds
<namespace>.<metric>_byteshttp_response_size_bytesSize in bytes

Step 5: Configure Alerting

Alert Rules (Prometheus)

# alerts.yml
groups:
  - name: api-alerts
    rules:
      - alert: HighErrorRate
        expr: |
          rate(http_requests_total{status=~"5.."}[5m])
          / rate(http_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Error rate above 5% for {{ $labels.service }}"
          runbook: "https://wiki.internal/runbooks/high-error-rate"

      - alert: HighLatency
        expr: |
          histogram_quantile(0.99,
            rate(http_request_duration_seconds_bucket[5m])
          ) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P99 latency above 2s for {{ $labels.service }}"
          runbook: "https://wiki.internal/runbooks/high-latency"

      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service {{ $labels.job }} is down"
          runbook: "https://wiki.internal/runbooks/service-down"

Alerting Anti-Patterns

Anti-PatternProblemFix
Alert on every metricAlert fatigue, team ignores alertsAlert only on symptoms (error rate, latency), not causes (CPU)
No severity levelsEverything is a page at 3amP1 = page on-call, P2 = Slack, P3 = ticket
No runbook linkedOn-call doesn’t know what to doEvery alert must link to a runbook
Alert for 30 secondsFlapping alerts from transient spikesRequire for: 5m minimum
No escalation pathPage goes unansweredOn-call → backup → team lead → manager

Debugging with Correlated Signals

The real power of observability comes from correlating traces, metrics, and logs:

1. METRIC ALERT: Error rate > 5% on payment-service
2. TRACE SEARCH: Find traces with errors in payment-service
   → Trace ID: abc-123-def shows 3.2s latency, error in DB call
3. LOG SEARCH: Filter logs by trace_id = abc-123-def
   → "Connection pool exhausted: max connections (25) reached"
4. ROOT CAUSE: Database connection pool too small for traffic spike

Ensure all three signals share common identifiers: trace_id, span_id, service_name.


Observability Maturity Model

LevelCharacteristicsTools Typically Used
Level 0: ReactiveCheck logs only after incidents; no dashboardsSSH + grep, basic CloudWatch
Level 1: MonitoringDashboards for key metrics, basic alertingGrafana, PagerDuty, basic APM
Level 2: ObservabilityDistributed tracing, structured logging, SLOs definedDatadog or New Relic, Jaeger, ELK
Level 3: ProactiveAnomaly detection, automated runbooks, error budgetsML-based alerting, Runbook automation
Level 4: PredictiveCapacity forecasting, chaos engineering, AIOpsGremlin, custom ML models, full SRE practice

Instrumentation Priority Order

When adding observability to an existing system, instrument in this order for maximum impact:

  1. Request latency (P50, P95, P99) — The most universal health signal
  2. Error rates (5xx, 4xx by endpoint) — Detect failures users experience
  3. Throughput (requests per sec) — Detect traffic anomalies
  4. Saturation (CPU, memory, disk, connections) — Predict capacity issues
  5. Dependencies (database latency, external API latency) — Find bottlenecks
  6. Business metrics (orders per min, signups per day) — Connect infra to revenue

Observability Checklist

  • OpenTelemetry SDK integrated in all services (auto + custom instrumentation)
  • Custom instrumentation follows best practices (business attributes, no PII)
  • Collector deployed with memory limiter and batch processor
  • Prometheus scraping metrics from all services
  • Jaeger/Tempo receiving traces with service-to-service correlation
  • Loki/ELK aggregating structured logs
  • Grafana dashboards for RED + USE metrics on every service
  • Alert rules cover error rate, latency, and availability
  • Every alert has a severity level, runbook link, and escalation path
  • Trace-metric-log correlation verified (search by trace_id)
  • On-call rotation established with escalation paths
  • Monthly alert review to reduce fatigue

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For infrastructure audits, visit garnetgrid.com. :::

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →