ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

Distributed Tracing

Trace requests across microservices to find performance bottlenecks and debug failures. Covers OpenTelemetry, trace propagation, span attributes, sampling strategies, trace analysis, and the patterns that make distributed systems debuggable.

In a microservice architecture, a single user request touches 5-20 services. When something is slow or broken, you need to know which service, which call, and which line of code is responsible. Distributed tracing follows a request through every service, creating a timeline of everything that happened.


What Is a Trace

User Request: GET /api/orders/123

Trace (complete journey through all services):
  ┌─────────────────────────────────────────────────────────┐
  │ Trace ID: abc-123                                        │
  │                                                          │
  │ [API Gateway]──────────────── 250ms ──────────────────── │
  │   ├─ [Auth Service]─── 15ms                              │
  │   ├─ [Order Service]──────── 200ms ──────────────        │
  │   │   ├─ [Order DB Query]── 50ms                         │
  │   │   ├─ [User Service]──── 80ms                         │
  │   │   │   └─ [User DB]──── 20ms                          │
  │   │   └─ [Product Service]─ 60ms                         │
  │   │       └─ [Redis Cache]─ 2ms                          │
  │   └─ [Response]── 5ms                                    │
  └─────────────────────────────────────────────────────────┘

Bottleneck found: User Service (80ms) → User DB (20ms)
The remaining 60ms is in User Service business logic

OpenTelemetry Setup

# OpenTelemetry: The standard for distributed tracing
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor

# Configure tracer
provider = TracerProvider()
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

# Auto-instrument libraries
FlaskInstrumentor().instrument()       # HTTP server
RequestsInstrumentor().instrument()    # Outbound HTTP calls
SQLAlchemyInstrumentor().instrument()  # Database queries

# Manual spans for business logic
tracer = trace.get_tracer("order-service")

def process_order(order_id):
    with tracer.start_as_current_span("process_order") as span:
        span.set_attribute("order.id", order_id)
        
        with tracer.start_as_current_span("validate_order"):
            validate(order_id)
        
        with tracer.start_as_current_span("calculate_total") as calc_span:
            total = calculate_total(order_id)
            calc_span.set_attribute("order.total", float(total))
        
        with tracer.start_as_current_span("charge_payment"):
            charge(order_id, total)

Trace Propagation

# Context propagation: How trace ID flows between services

# Service A makes HTTP call to Service B
# OpenTelemetry automatically adds headers:
#   traceparent: 00-abc123-def456-01
#   tracestate: vendor=value

# The header format (W3C Trace Context):
# traceparent: {version}-{trace-id}-{parent-span-id}-{trace-flags}
# 
# trace-id: 32 hex chars, unique per trace
# parent-span-id: 16 hex chars, identifies calling span
# trace-flags: 01 = sampled

# Service B reads these headers and continues the trace
# Same trace-id → same trace → correlated across services

Sampling Strategies

# Head-based sampling: Decide at trace start
# Sample 10% of all traces
from opentelemetry.sdk.trace.sampling import TraceIdRatioBased
sampler = TraceIdRatioBased(0.10)

# Tail-based sampling: Decide after trace completes
# Sample 100% of slow/error traces, 1% of normal
# Requires a collector that buffers traces
# OpenTelemetry Collector with tail_sampling processor:
#
# processors:
#   tail_sampling:
#     policies:
#       - name: errors
#         type: status_code
#         status_code: {status_codes: [ERROR]}
#         # 100% of error traces sampled
#       
#       - name: slow_traces
#         type: latency
#         latency: {threshold_ms: 1000}
#         # 100% of traces > 1 second
#       
#       - name: default
#         type: probabilistic
#         probabilistic: {sampling_percentage: 1}
#         # 1% of normal traces

Anti-Patterns

Anti-PatternConsequenceFix
No trace context propagationBroken traces, isolated spansUse OpenTelemetry auto-instrumentation
Sample everything (100%)Storage costs explodeTail-based sampling (errors + slow + 1%)
No span attributesTraces exist but aren’t searchableAdd business context (user_id, order_id)
Tracing only HTTP callsMiss database, cache, queue spansInstrument all I/O
No trace-to-log correlationCannot connect trace to logsInclude trace_id in log context

Distributed tracing is the debugging tool for microservices. Without it, debugging a production issue across 20 services is like finding a needle in 20 haystacks simultaneously.

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →