Distributed Tracing
Trace requests across microservices to find performance bottlenecks and debug failures. Covers OpenTelemetry, trace propagation, span attributes, sampling strategies, trace analysis, and the patterns that make distributed systems debuggable.
In a microservice architecture, a single user request touches 5-20 services. When something is slow or broken, you need to know which service, which call, and which line of code is responsible. Distributed tracing follows a request through every service, creating a timeline of everything that happened.
What Is a Trace
User Request: GET /api/orders/123
Trace (complete journey through all services):
┌─────────────────────────────────────────────────────────┐
│ Trace ID: abc-123 │
│ │
│ [API Gateway]──────────────── 250ms ──────────────────── │
│ ├─ [Auth Service]─── 15ms │
│ ├─ [Order Service]──────── 200ms ────────────── │
│ │ ├─ [Order DB Query]── 50ms │
│ │ ├─ [User Service]──── 80ms │
│ │ │ └─ [User DB]──── 20ms │
│ │ └─ [Product Service]─ 60ms │
│ │ └─ [Redis Cache]─ 2ms │
│ └─ [Response]── 5ms │
└─────────────────────────────────────────────────────────┘
Bottleneck found: User Service (80ms) → User DB (20ms)
The remaining 60ms is in User Service business logic
OpenTelemetry Setup
# OpenTelemetry: The standard for distributed tracing
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
# Configure tracer
provider = TracerProvider()
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
# Auto-instrument libraries
FlaskInstrumentor().instrument() # HTTP server
RequestsInstrumentor().instrument() # Outbound HTTP calls
SQLAlchemyInstrumentor().instrument() # Database queries
# Manual spans for business logic
tracer = trace.get_tracer("order-service")
def process_order(order_id):
with tracer.start_as_current_span("process_order") as span:
span.set_attribute("order.id", order_id)
with tracer.start_as_current_span("validate_order"):
validate(order_id)
with tracer.start_as_current_span("calculate_total") as calc_span:
total = calculate_total(order_id)
calc_span.set_attribute("order.total", float(total))
with tracer.start_as_current_span("charge_payment"):
charge(order_id, total)
Trace Propagation
# Context propagation: How trace ID flows between services
# Service A makes HTTP call to Service B
# OpenTelemetry automatically adds headers:
# traceparent: 00-abc123-def456-01
# tracestate: vendor=value
# The header format (W3C Trace Context):
# traceparent: {version}-{trace-id}-{parent-span-id}-{trace-flags}
#
# trace-id: 32 hex chars, unique per trace
# parent-span-id: 16 hex chars, identifies calling span
# trace-flags: 01 = sampled
# Service B reads these headers and continues the trace
# Same trace-id → same trace → correlated across services
Sampling Strategies
# Head-based sampling: Decide at trace start
# Sample 10% of all traces
from opentelemetry.sdk.trace.sampling import TraceIdRatioBased
sampler = TraceIdRatioBased(0.10)
# Tail-based sampling: Decide after trace completes
# Sample 100% of slow/error traces, 1% of normal
# Requires a collector that buffers traces
# OpenTelemetry Collector with tail_sampling processor:
#
# processors:
# tail_sampling:
# policies:
# - name: errors
# type: status_code
# status_code: {status_codes: [ERROR]}
# # 100% of error traces sampled
#
# - name: slow_traces
# type: latency
# latency: {threshold_ms: 1000}
# # 100% of traces > 1 second
#
# - name: default
# type: probabilistic
# probabilistic: {sampling_percentage: 1}
# # 1% of normal traces
Anti-Patterns
| Anti-Pattern | Consequence | Fix |
|---|---|---|
| No trace context propagation | Broken traces, isolated spans | Use OpenTelemetry auto-instrumentation |
| Sample everything (100%) | Storage costs explode | Tail-based sampling (errors + slow + 1%) |
| No span attributes | Traces exist but aren’t searchable | Add business context (user_id, order_id) |
| Tracing only HTTP calls | Miss database, cache, queue spans | Instrument all I/O |
| No trace-to-log correlation | Cannot connect trace to logs | Include trace_id in log context |
Distributed tracing is the debugging tool for microservices. Without it, debugging a production issue across 20 services is like finding a needle in 20 haystacks simultaneously.