How to Implement Observability: Traces, Metrics, and Logs at Scale

Monitoring tells you something is broken. Observability tells you why. In distributed systems, you can’t debug with console.log. You need traces to follow requests across services, metrics to spot trends, and logs for the details. This guide walks through building a complete observability stack from instrumentation to alerting, using OpenTelemetry as the universal standard.

The key distinction: monitoring is about known-unknowns (“alert me when CPU > 80%”), while observability is about unknown-unknowns (“why are 2% of requests for European users timing out on Tuesdays?”). You need both.

The Three Pillars

┌─────────────┐   ┌─────────────┐   ┌─────────────┐
│   TRACES    │   │   METRICS   │   │    LOGS     │
│             │   │             │   │             │
│ Request flow│   │ Aggregated  │   │ Individual  │
│ across      │   │ time-series │   │ event       │
│ services    │   │ data        │   │ records     │
│             │   │             │   │             │
│ "What path?"│   │ "What trend?"│  │"What detail?"│
└─────────────┘   └─────────────┘   └─────────────┘

When to Use Which

Signal	Use When	Example
Traces	Debugging slow requests, understanding service dependencies	”This request spent 1.2s in the payment service”
Metrics	Setting alerts, tracking SLOs, capacity planning	”Error rate is 3.2%, latency p99 is 450ms”
Logs	Investigating specific events, audit trails, debugging	”User auth failed: invalid token at 14:32:05”
All three	Production incident investigation	Metric spikes → trace the slow requests → read the error logs

Step 1: Instrument with OpenTelemetry

1.1 Node.js Auto-Instrumentation

// tracing.js — load BEFORE your application code
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');
const { OTLPMetricExporter } = require('@opentelemetry/exporter-metrics-otlp-http');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');

const sdk = new NodeSDK({
  serviceName: 'api-service',
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4318/v1/traces',
  }),
  metricExporter: new OTLPMetricExporter({
    url: 'http://otel-collector:4318/v1/metrics',
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      '@opentelemetry/instrumentation-http': { enabled: true },
      '@opentelemetry/instrumentation-express': { enabled: true },
      '@opentelemetry/instrumentation-pg': { enabled: true },
      '@opentelemetry/instrumentation-redis': { enabled: true },
    }),
  ],
});

sdk.start();

# Run your app with tracing
node --require ./tracing.js app.js

1.2 Python Auto-Instrumentation

from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor

# Configure trace provider
trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
)

# Auto-instrument frameworks (zero code changes to your routes)
FlaskInstrumentor().instrument()
SQLAlchemyInstrumentor().instrument()
RequestsInstrumentor().instrument()

1.3 Custom Instrumentation Best Practices

tracer = trace.get_tracer(__name__)

@app.route("/api/orders")
def create_order():
    with tracer.start_as_current_span("create_order") as span:
        # DO: Add business-relevant attributes
        span.set_attribute("order.customer_id", customer_id)
        span.set_attribute("order.total", total)
        span.set_attribute("order.item_count", len(items))

        # DO: Create child spans for significant operations
        with tracer.start_as_current_span("validate_inventory"):
            check_inventory(items)

        with tracer.start_as_current_span("process_payment"):
            charge_card(payment)

        # DON'T: Instrument every single function call
        # DON'T: Put sensitive data (PII, passwords) in spans
        # DON'T: Create spans inside tight loops

Step 2: Deploy the Collector Stack

# docker-compose.yml — Observability Stack
services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    ports:
      - "4317:4317"   # gRPC
      - "4318:4318"   # HTTP
    volumes:
      - ./otel-config.yaml:/etc/otel/config.yaml
    command: ["--config=/etc/otel/config.yaml"]

  prometheus:
    image: prom/prometheus:latest
    ports: ["9090:9090"]
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml

  grafana:
    image: grafana/grafana:latest
    ports: ["3000:3000"]
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - grafana_data:/var/lib/grafana

  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"  # UI
      - "14268:14268"  # Collector

  loki:
    image: grafana/loki:latest
    ports: ["3100:3100"]

Collector Configuration

# otel-config.yaml
receivers:
  otlp:
    protocols:
      grpc: { endpoint: "0.0.0.0:4317" }
      http: { endpoint: "0.0.0.0:4318" }

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024
  # Memory limiter prevents OOM
  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128

exporters:
  prometheus:
    endpoint: "0.0.0.0:8889"
  jaeger:
    endpoint: "jaeger:14250"
    tls: { insecure: true }
  loki:
    endpoint: "http://loki:3100/loki/api/v1/push"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [jaeger]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheus]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [loki]

Step 3: Define Key Metrics (RED + USE)

RED Method (Request-oriented)

Metric	What It Measures	PromQL Example	Alert Threshold
Rate	Requests per second	`rate(http_requests_total[5m])`	> 50% drop from baseline
Errors	Error rate percentage	`rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])`	> 1% for 5 minutes
Duration	Latency percentiles	`histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))`	p99 > 2s for 5 minutes

USE Method (Resource-oriented)

Metric	What It Measures	PromQL Example	Alert Threshold
Utilization	How busy is the resource?	`avg(rate(node_cpu_seconds_total{mode!="idle"}[5m]))`	> 80% sustained 15 min
Saturation	How much does it queue?	`node_load1 / count(node_cpu_seconds_total{mode="idle"})`	> 2.0 for 10 minutes
Errors	How often does it fail?	`rate(node_disk_io_time_weighted_seconds_total[5m])`	Any disk error

Step 4: Build Custom Metrics

from opentelemetry import metrics

meter = metrics.get_meter("api-service")

# Counter — monotonically increasing (requests, errors, events)
request_counter = meter.create_counter(
    "api.requests",
    description="Total API requests",
    unit="1"
)

# Histogram — distribution of values (latency, response size)
latency_histogram = meter.create_histogram(
    "api.latency",
    description="Request latency in milliseconds",
    unit="ms"
)

# Observable Gauge — current state (connections, queue depth, cache size)
def get_queue_depth(observer):
    observer.observe(queue.qsize(), {"queue": "main"})

meter.create_observable_gauge(
    "api.queue_depth",
    callbacks=[get_queue_depth],
    description="Current queue depth"
)

# Usage in request handler
@app.route("/api/customers")
def list_customers():
    start = time.time()
    try:
        result = db.query("SELECT * FROM customers")
        request_counter.add(1, {"method": "GET", "endpoint": "/customers", "status": "200"})
        return jsonify(result)
    except Exception as e:
        request_counter.add(1, {"method": "GET", "endpoint": "/customers", "status": "500"})
        raise
    finally:
        latency_histogram.record(
            (time.time() - start) * 1000,
            {"method": "GET", "endpoint": "/customers"}
        )

Metric Naming Conventions

Pattern	Example	Description
`<namespace>.<metric>`	`api.requests`	Simple counter
`<namespace>.<metric>_total`	`http_requests_total`	Prometheus convention for counters
`<namespace>.<metric>_seconds`	`http_request_duration_seconds`	Duration in seconds
`<namespace>.<metric>_bytes`	`http_response_size_bytes`	Size in bytes

Step 5: Configure Alerting

Alert Rules (Prometheus)

# alerts.yml
groups:
  - name: api-alerts
    rules:
      - alert: HighErrorRate
        expr: |
          rate(http_requests_total{status=~"5.."}[5m])
          / rate(http_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Error rate above 5% for {{ $labels.service }}"
          runbook: "https://wiki.internal/runbooks/high-error-rate"

      - alert: HighLatency
        expr: |
          histogram_quantile(0.99,
            rate(http_request_duration_seconds_bucket[5m])
          ) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P99 latency above 2s for {{ $labels.service }}"
          runbook: "https://wiki.internal/runbooks/high-latency"

      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service {{ $labels.job }} is down"
          runbook: "https://wiki.internal/runbooks/service-down"

Alerting Anti-Patterns

Anti-Pattern	Problem	Fix
Alert on every metric	Alert fatigue, team ignores alerts	Alert only on symptoms (error rate, latency), not causes (CPU)
No severity levels	Everything is a page at 3am	P1 = page on-call, P2 = Slack, P3 = ticket
No runbook linked	On-call doesn’t know what to do	Every alert must link to a runbook
Alert for 30 seconds	Flapping alerts from transient spikes	Require `for: 5m` minimum
No escalation path	Page goes unanswered	On-call → backup → team lead → manager

Debugging with Correlated Signals

The real power of observability comes from correlating traces, metrics, and logs:

1. METRIC ALERT: Error rate > 5% on payment-service
2. TRACE SEARCH: Find traces with errors in payment-service
   → Trace ID: abc-123-def shows 3.2s latency, error in DB call
3. LOG SEARCH: Filter logs by trace_id = abc-123-def
   → "Connection pool exhausted: max connections (25) reached"
4. ROOT CAUSE: Database connection pool too small for traffic spike

Ensure all three signals share common identifiers: trace_id, span_id, service_name.

Observability Maturity Model

Level	Characteristics	Tools Typically Used
Level 0: Reactive	Check logs only after incidents; no dashboards	SSH + grep, basic CloudWatch
Level 1: Monitoring	Dashboards for key metrics, basic alerting	Grafana, PagerDuty, basic APM
Level 2: Observability	Distributed tracing, structured logging, SLOs defined	Datadog or New Relic, Jaeger, ELK
Level 3: Proactive	Anomaly detection, automated runbooks, error budgets	ML-based alerting, Runbook automation
Level 4: Predictive	Capacity forecasting, chaos engineering, AIOps	Gremlin, custom ML models, full SRE practice

Instrumentation Priority Order

When adding observability to an existing system, instrument in this order for maximum impact:

Request latency (P50, P95, P99) — The most universal health signal
Error rates (5xx, 4xx by endpoint) — Detect failures users experience
Throughput (requests per sec) — Detect traffic anomalies
Saturation (CPU, memory, disk, connections) — Predict capacity issues
Dependencies (database latency, external API latency) — Find bottlenecks
Business metrics (orders per min, signups per day) — Connect infra to revenue

Observability Checklist

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For infrastructure audits, visit garnetgrid.com. :::