How to Implement Observability: Traces, Metrics, and Logs at Scale
Build a production observability stack. Covers OpenTelemetry instrumentation, Prometheus metrics, distributed tracing, log aggregation, and alerting strategies.
Monitoring tells you something is broken. Observability tells you why. In distributed systems, you can’t debug with console.log. You need traces to follow requests across services, metrics to spot trends, and logs for the details. This guide walks through building a complete observability stack from instrumentation to alerting, using OpenTelemetry as the universal standard.
The key distinction: monitoring is about known-unknowns (“alert me when CPU > 80%”), while observability is about unknown-unknowns (“why are 2% of requests for European users timing out on Tuesdays?”). You need both.
The Three Pillars
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ TRACES │ │ METRICS │ │ LOGS │
│ │ │ │ │ │
│ Request flow│ │ Aggregated │ │ Individual │
│ across │ │ time-series │ │ event │
│ services │ │ data │ │ records │
│ │ │ │ │ │
│ "What path?"│ │ "What trend?"│ │"What detail?"│
└─────────────┘ └─────────────┘ └─────────────┘
When to Use Which
| Signal | Use When | Example |
|---|---|---|
| Traces | Debugging slow requests, understanding service dependencies | ”This request spent 1.2s in the payment service” |
| Metrics | Setting alerts, tracking SLOs, capacity planning | ”Error rate is 3.2%, latency p99 is 450ms” |
| Logs | Investigating specific events, audit trails, debugging | ”User auth failed: invalid token at 14:32:05” |
| All three | Production incident investigation | Metric spikes → trace the slow requests → read the error logs |
Step 1: Instrument with OpenTelemetry
1.1 Node.js Auto-Instrumentation
// tracing.js — load BEFORE your application code
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');
const { OTLPMetricExporter } = require('@opentelemetry/exporter-metrics-otlp-http');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const sdk = new NodeSDK({
serviceName: 'api-service',
traceExporter: new OTLPTraceExporter({
url: 'http://otel-collector:4318/v1/traces',
}),
metricExporter: new OTLPMetricExporter({
url: 'http://otel-collector:4318/v1/metrics',
}),
instrumentations: [
getNodeAutoInstrumentations({
'@opentelemetry/instrumentation-http': { enabled: true },
'@opentelemetry/instrumentation-express': { enabled: true },
'@opentelemetry/instrumentation-pg': { enabled: true },
'@opentelemetry/instrumentation-redis': { enabled: true },
}),
],
});
sdk.start();
# Run your app with tracing
node --require ./tracing.js app.js
1.2 Python Auto-Instrumentation
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
# Configure trace provider
trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
)
# Auto-instrument frameworks (zero code changes to your routes)
FlaskInstrumentor().instrument()
SQLAlchemyInstrumentor().instrument()
RequestsInstrumentor().instrument()
1.3 Custom Instrumentation Best Practices
tracer = trace.get_tracer(__name__)
@app.route("/api/orders")
def create_order():
with tracer.start_as_current_span("create_order") as span:
# DO: Add business-relevant attributes
span.set_attribute("order.customer_id", customer_id)
span.set_attribute("order.total", total)
span.set_attribute("order.item_count", len(items))
# DO: Create child spans for significant operations
with tracer.start_as_current_span("validate_inventory"):
check_inventory(items)
with tracer.start_as_current_span("process_payment"):
charge_card(payment)
# DON'T: Instrument every single function call
# DON'T: Put sensitive data (PII, passwords) in spans
# DON'T: Create spans inside tight loops
Step 2: Deploy the Collector Stack
# docker-compose.yml — Observability Stack
services:
otel-collector:
image: otel/opentelemetry-collector-contrib:latest
ports:
- "4317:4317" # gRPC
- "4318:4318" # HTTP
volumes:
- ./otel-config.yaml:/etc/otel/config.yaml
command: ["--config=/etc/otel/config.yaml"]
prometheus:
image: prom/prometheus:latest
ports: ["9090:9090"]
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
grafana:
image: grafana/grafana:latest
ports: ["3000:3000"]
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
- grafana_data:/var/lib/grafana
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "16686:16686" # UI
- "14268:14268" # Collector
loki:
image: grafana/loki:latest
ports: ["3100:3100"]
Collector Configuration
# otel-config.yaml
receivers:
otlp:
protocols:
grpc: { endpoint: "0.0.0.0:4317" }
http: { endpoint: "0.0.0.0:4318" }
processors:
batch:
timeout: 5s
send_batch_size: 1024
# Memory limiter prevents OOM
memory_limiter:
check_interval: 1s
limit_mib: 512
spike_limit_mib: 128
exporters:
prometheus:
endpoint: "0.0.0.0:8889"
jaeger:
endpoint: "jaeger:14250"
tls: { insecure: true }
loki:
endpoint: "http://loki:3100/loki/api/v1/push"
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [jaeger]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [prometheus]
logs:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [loki]
Step 3: Define Key Metrics (RED + USE)
RED Method (Request-oriented)
| Metric | What It Measures | PromQL Example | Alert Threshold |
|---|---|---|---|
| Rate | Requests per second | rate(http_requests_total[5m]) | > 50% drop from baseline |
| Errors | Error rate percentage | rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) | > 1% for 5 minutes |
| Duration | Latency percentiles | histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) | p99 > 2s for 5 minutes |
USE Method (Resource-oriented)
| Metric | What It Measures | PromQL Example | Alert Threshold |
|---|---|---|---|
| Utilization | How busy is the resource? | avg(rate(node_cpu_seconds_total{mode!="idle"}[5m])) | > 80% sustained 15 min |
| Saturation | How much does it queue? | node_load1 / count(node_cpu_seconds_total{mode="idle"}) | > 2.0 for 10 minutes |
| Errors | How often does it fail? | rate(node_disk_io_time_weighted_seconds_total[5m]) | Any disk error |
Step 4: Build Custom Metrics
from opentelemetry import metrics
meter = metrics.get_meter("api-service")
# Counter — monotonically increasing (requests, errors, events)
request_counter = meter.create_counter(
"api.requests",
description="Total API requests",
unit="1"
)
# Histogram — distribution of values (latency, response size)
latency_histogram = meter.create_histogram(
"api.latency",
description="Request latency in milliseconds",
unit="ms"
)
# Observable Gauge — current state (connections, queue depth, cache size)
def get_queue_depth(observer):
observer.observe(queue.qsize(), {"queue": "main"})
meter.create_observable_gauge(
"api.queue_depth",
callbacks=[get_queue_depth],
description="Current queue depth"
)
# Usage in request handler
@app.route("/api/customers")
def list_customers():
start = time.time()
try:
result = db.query("SELECT * FROM customers")
request_counter.add(1, {"method": "GET", "endpoint": "/customers", "status": "200"})
return jsonify(result)
except Exception as e:
request_counter.add(1, {"method": "GET", "endpoint": "/customers", "status": "500"})
raise
finally:
latency_histogram.record(
(time.time() - start) * 1000,
{"method": "GET", "endpoint": "/customers"}
)
Metric Naming Conventions
| Pattern | Example | Description |
|---|---|---|
<namespace>.<metric> | api.requests | Simple counter |
<namespace>.<metric>_total | http_requests_total | Prometheus convention for counters |
<namespace>.<metric>_seconds | http_request_duration_seconds | Duration in seconds |
<namespace>.<metric>_bytes | http_response_size_bytes | Size in bytes |
Step 5: Configure Alerting
Alert Rules (Prometheus)
# alerts.yml
groups:
- name: api-alerts
rules:
- alert: HighErrorRate
expr: |
rate(http_requests_total{status=~"5.."}[5m])
/ rate(http_requests_total[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate above 5% for {{ $labels.service }}"
runbook: "https://wiki.internal/runbooks/high-error-rate"
- alert: HighLatency
expr: |
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket[5m])
) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "P99 latency above 2s for {{ $labels.service }}"
runbook: "https://wiki.internal/runbooks/high-latency"
- alert: ServiceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.job }} is down"
runbook: "https://wiki.internal/runbooks/service-down"
Alerting Anti-Patterns
| Anti-Pattern | Problem | Fix |
|---|---|---|
| Alert on every metric | Alert fatigue, team ignores alerts | Alert only on symptoms (error rate, latency), not causes (CPU) |
| No severity levels | Everything is a page at 3am | P1 = page on-call, P2 = Slack, P3 = ticket |
| No runbook linked | On-call doesn’t know what to do | Every alert must link to a runbook |
| Alert for 30 seconds | Flapping alerts from transient spikes | Require for: 5m minimum |
| No escalation path | Page goes unanswered | On-call → backup → team lead → manager |
Debugging with Correlated Signals
The real power of observability comes from correlating traces, metrics, and logs:
1. METRIC ALERT: Error rate > 5% on payment-service
2. TRACE SEARCH: Find traces with errors in payment-service
→ Trace ID: abc-123-def shows 3.2s latency, error in DB call
3. LOG SEARCH: Filter logs by trace_id = abc-123-def
→ "Connection pool exhausted: max connections (25) reached"
4. ROOT CAUSE: Database connection pool too small for traffic spike
Ensure all three signals share common identifiers: trace_id, span_id, service_name.
Observability Maturity Model
| Level | Characteristics | Tools Typically Used |
|---|---|---|
| Level 0: Reactive | Check logs only after incidents; no dashboards | SSH + grep, basic CloudWatch |
| Level 1: Monitoring | Dashboards for key metrics, basic alerting | Grafana, PagerDuty, basic APM |
| Level 2: Observability | Distributed tracing, structured logging, SLOs defined | Datadog or New Relic, Jaeger, ELK |
| Level 3: Proactive | Anomaly detection, automated runbooks, error budgets | ML-based alerting, Runbook automation |
| Level 4: Predictive | Capacity forecasting, chaos engineering, AIOps | Gremlin, custom ML models, full SRE practice |
Instrumentation Priority Order
When adding observability to an existing system, instrument in this order for maximum impact:
- Request latency (P50, P95, P99) — The most universal health signal
- Error rates (5xx, 4xx by endpoint) — Detect failures users experience
- Throughput (requests per sec) — Detect traffic anomalies
- Saturation (CPU, memory, disk, connections) — Predict capacity issues
- Dependencies (database latency, external API latency) — Find bottlenecks
- Business metrics (orders per min, signups per day) — Connect infra to revenue
Observability Checklist
- OpenTelemetry SDK integrated in all services (auto + custom instrumentation)
- Custom instrumentation follows best practices (business attributes, no PII)
- Collector deployed with memory limiter and batch processor
- Prometheus scraping metrics from all services
- Jaeger/Tempo receiving traces with service-to-service correlation
- Loki/ELK aggregating structured logs
- Grafana dashboards for RED + USE metrics on every service
- Alert rules cover error rate, latency, and availability
- Every alert has a severity level, runbook link, and escalation path
- Trace-metric-log correlation verified (search by trace_id)
- On-call rotation established with escalation paths
- Monthly alert review to reduce fatigue
:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For infrastructure audits, visit garnetgrid.com. :::