Distributed Tracing | The Garnet Wiki

In a microservice architecture, a single user request touches 5-20 services. When something is slow or broken, you need to know which service, which call, and which line of code is responsible. Distributed tracing follows a request through every service, creating a timeline of everything that happened.

What Is a Trace

User Request: GET /api/orders/123

Trace (complete journey through all services):
  ┌─────────────────────────────────────────────────────────┐
  │ Trace ID: abc-123                                        │
  │                                                          │
  │ [API Gateway]──────────────── 250ms ──────────────────── │
  │   ├─ [Auth Service]─── 15ms                              │
  │   ├─ [Order Service]──────── 200ms ──────────────        │
  │   │   ├─ [Order DB Query]── 50ms                         │
  │   │   ├─ [User Service]──── 80ms                         │
  │   │   │   └─ [User DB]──── 20ms                          │
  │   │   └─ [Product Service]─ 60ms                         │
  │   │       └─ [Redis Cache]─ 2ms                          │
  │   └─ [Response]── 5ms                                    │
  └─────────────────────────────────────────────────────────┘

Bottleneck found: User Service (80ms) → User DB (20ms)
The remaining 60ms is in User Service business logic

OpenTelemetry Setup

# OpenTelemetry: The standard for distributed tracing
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor

# Configure tracer
provider = TracerProvider()
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

# Auto-instrument libraries
FlaskInstrumentor().instrument()       # HTTP server
RequestsInstrumentor().instrument()    # Outbound HTTP calls
SQLAlchemyInstrumentor().instrument()  # Database queries

# Manual spans for business logic
tracer = trace.get_tracer("order-service")

def process_order(order_id):
    with tracer.start_as_current_span("process_order") as span:
        span.set_attribute("order.id", order_id)
        
        with tracer.start_as_current_span("validate_order"):
            validate(order_id)
        
        with tracer.start_as_current_span("calculate_total") as calc_span:
            total = calculate_total(order_id)
            calc_span.set_attribute("order.total", float(total))
        
        with tracer.start_as_current_span("charge_payment"):
            charge(order_id, total)

Trace Propagation

# Context propagation: How trace ID flows between services

# Service A makes HTTP call to Service B
# OpenTelemetry automatically adds headers:
#   traceparent: 00-abc123-def456-01
#   tracestate: vendor=value

# The header format (W3C Trace Context):
# traceparent: {version}-{trace-id}-{parent-span-id}-{trace-flags}
# 
# trace-id: 32 hex chars, unique per trace
# parent-span-id: 16 hex chars, identifies calling span
# trace-flags: 01 = sampled

# Service B reads these headers and continues the trace
# Same trace-id → same trace → correlated across services

Sampling Strategies

# Head-based sampling: Decide at trace start
# Sample 10% of all traces
from opentelemetry.sdk.trace.sampling import TraceIdRatioBased
sampler = TraceIdRatioBased(0.10)

# Tail-based sampling: Decide after trace completes
# Sample 100% of slow/error traces, 1% of normal
# Requires a collector that buffers traces
# OpenTelemetry Collector with tail_sampling processor:
#
# processors:
#   tail_sampling:
#     policies:
#       - name: errors
#         type: status_code
#         status_code: {status_codes: [ERROR]}
#         # 100% of error traces sampled
#       
#       - name: slow_traces
#         type: latency
#         latency: {threshold_ms: 1000}
#         # 100% of traces > 1 second
#       
#       - name: default
#         type: probabilistic
#         probabilistic: {sampling_percentage: 1}
#         # 1% of normal traces

Anti-Patterns

Anti-Pattern	Consequence	Fix
No trace context propagation	Broken traces, isolated spans	Use OpenTelemetry auto-instrumentation
Sample everything (100%)	Storage costs explode	Tail-based sampling (errors + slow + 1%)
No span attributes	Traces exist but aren’t searchable	Add business context (user_id, order_id)
Tracing only HTTP calls	Miss database, cache, queue spans	Instrument all I/O
No trace-to-log correlation	Cannot connect trace to logs	Include trace_id in log context

Distributed tracing is the debugging tool for microservices. Without it, debugging a production issue across 20 services is like finding a needle in 20 haystacks simultaneously.

What Is a Trace

OpenTelemetry Setup

Trace Propagation

Sampling Strategies

Anti-Patterns

More in Site Reliability Engineering

Capacity Planning: Scaling Infrastructure Before You Need To

SRE Capacity Forecasting

Capacity Planning