Distributed Tracing: Following Requests Across Service Boundaries

Every monolith-to-microservices migration hits the same wall. A request that used to be a single stack trace now crosses five services, three databases, and two message queues. When that request takes 4 seconds instead of 400 milliseconds, nobody knows where the time went.

Distributed tracing solves this by assigning a unique trace ID to every request at the edge and propagating it through every service, queue, and database call. Each unit of work becomes a span — a timed operation with metadata — and the collection of spans for a single request becomes a trace.

The result is a timeline view showing exactly where your time goes.

Core Concepts

Traces, Spans, and Context

A trace represents the entire journey of a request through your system. It is composed of spans, each representing a discrete operation:

Trace: abc123
├── Span: API Gateway (12ms)
│   ├── Span: Auth Service (3ms)
│   └── Span: Order Service (45ms)
│       ├── Span: Inventory Check (8ms)
│       ├── Span: Payment Service (22ms)
│       │   └── Span: Stripe API Call (18ms)
│       └── Span: Database Write (6ms)

Each span carries:

Trace ID: Shared across all spans in the trace
Span ID: Unique to this span
Parent Span ID: Links child to parent
Start/End timestamps: For duration calculation
Attributes: Key-value metadata (user ID, HTTP status, etc.)
Events: Timestamped annotations within the span
Status: OK, ERROR, or UNSET

Context Propagation

Context propagation is the mechanism that carries trace identity across process boundaries. Without it, each service creates isolated traces that cannot be correlated.

The W3C Trace Context standard defines two HTTP headers:

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
tracestate: vendor1=value1,vendor2=value2

The traceparent header encodes version, trace ID, parent span ID, and trace flags. Every service extracts this header, creates a child span with the same trace ID, and injects the updated header into outgoing requests.

OpenTelemetry Implementation

OpenTelemetry has become the standard for instrumentation. It provides vendor-neutral SDKs for generating traces, metrics, and logs.

Auto-Instrumentation

Most frameworks have auto-instrumentation libraries that capture HTTP, database, and messaging spans without code changes:

# Python: Auto-instrument Flask + SQLAlchemy + requests
from opentelemetry import trace
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor

FlaskInstrumentor().instrument_app(app)
SQLAlchemyInstrumentor().instrument(engine=engine)
RequestsInstrumentor().instrument()

This captures every incoming HTTP request, every SQL query, and every outgoing HTTP call as spans — automatically linked into traces.

Manual Instrumentation

Auto-instrumentation captures infrastructure. Manual instrumentation captures business logic:

tracer = trace.get_tracer("order-service")

def process_order(order_id: str, items: list):
    with tracer.start_as_current_span("process_order") as span:
        span.set_attribute("order.id", order_id)
        span.set_attribute("order.item_count", len(items))
        
        with tracer.start_as_current_span("validate_inventory"):
            available = check_inventory(items)
            if not available:
                span.set_status(StatusCode.ERROR, "Insufficient inventory")
                span.add_event("inventory_check_failed", {
                    "missing_items": str(get_missing(items))
                })
                raise InsufficientInventoryError()
        
        with tracer.start_as_current_span("charge_payment"):
            charge_result = payment_service.charge(order_id, total)
            span.add_event("payment_processed", {
                "amount": str(total),
                "provider": "stripe"
            })

Span Design Principles

What Deserves a Span

Create spans for operations that:

Cross a network boundary (HTTP, gRPC, message queue)
Involve I/O (database queries, file reads, external APIs)
Represent significant business logic (order processing, fraud checks)
Are potential latency sources you need to measure

Do not create spans for:

Pure computation under 1ms
Simple variable assignments or transformations
Every function call (this creates “span explosion”)

Attribute Standards

Standardize attributes across services using semantic conventions:

# HTTP spans
http.method: GET
http.url: /api/orders/123
http.status_code: 200
http.response_content_length: 4523

# Database spans
db.system: postgresql
db.statement: SELECT * FROM orders WHERE id = ?
db.operation: SELECT

# Custom business attributes
order.id: ORD-2026-0456
order.total: 2499.99
customer.tier: enterprise

Sampling Strategies

At scale, tracing every request is prohibitively expensive. A service handling 10,000 RPS generating 15 spans per request produces 150,000 spans per second — roughly 50 GB of data per day.

Head-Based Sampling

The sampling decision is made at the edge, before the trace starts:

Probabilistic: Sample 10% of traces randomly
Rate-limiting: Sample at most 100 traces per second
Rule-based: Sample all traces matching specific criteria (e.g., admin users, high-value orders)

The advantage is simplicity. The disadvantage is that you might miss the exact trace showing a rare error.

Tail-Based Sampling

The decision is deferred until the trace is complete. A collector buffers all spans and only exports traces meeting certain criteria:

Duration exceeds threshold (>2s)
Contains error status codes
Matches attribute filters (specific customer, specific endpoint)

Tail-based sampling captures exactly the traces you care about — errors and slow requests — but requires buffering infrastructure and adds latency to trace visibility.

Practical Recommendation

Use a combination: head-based sampling at 5-10% for baseline, plus tail-based sampling for errors and latency outliers. This gives you statistical coverage and guaranteed visibility of problems.

Integrating Traces with Logs and Metrics

Traces become dramatically more powerful when correlated with logs and metrics.

Trace-Log Correlation

Inject the trace ID into every log line:

import logging
from opentelemetry import trace

class TraceIdFilter(logging.Filter):
    def filter(self, record):
        span = trace.get_current_span()
        ctx = span.get_span_context()
        record.trace_id = format(ctx.trace_id, '032x') if ctx.trace_id else ''
        record.span_id = format(ctx.span_id, '016x') if ctx.span_id else ''
        return True

logger = logging.getLogger(__name__)
logger.addFilter(TraceIdFilter())
handler.setFormatter(logging.Formatter(
    '%(asctime)s [%(trace_id)s:%(span_id)s] %(levelname)s %(message)s'
))

Now every log line carries its trace ID. When investigating a slow trace, you can query logs by trace ID and see every log message emitted during that request — across all services.

Trace-Metric Correlation (Exemplars)

Metrics tell you that P99 latency spiked. Traces tell you why. Exemplars link the two by attaching a trace ID to a metric data point:

When your dashboard shows a latency spike, clicking the exemplar takes you directly to the trace causing it.

Common Anti-Patterns

Span Explosion

Creating spans for every function call overwhelms your tracing backend and makes traces unreadable. A 500-span trace is noise. A 15-span trace with well-chosen boundaries tells a clear story.

Missing Context Propagation

If even one service in your chain fails to propagate context, the trace breaks into disconnected fragments. Check every service — including proxies, load balancers, and message brokers — for proper header forwarding.

Sensitive Data in Attributes

SQL statements, request bodies, and user emails in span attributes create compliance risks. Scrub sensitive fields before export, or use attribute allowlists.

Ignoring Asynchronous Flows

Message queue consumers need explicit context extraction:

def process_message(message):
    # Extract context from message headers
    ctx = extract(message.headers)
    with tracer.start_as_current_span("process_message", context=ctx):
        handle(message.body)

Without this, async work appears as disconnected traces.

Choosing a Backend

Backend	Strengths	Considerations
Jaeger	Open source, battle-tested	Requires infrastructure management
Tempo (Grafana)	Integrates with Grafana stack, cost-effective	Log-based storage can be slow
Honeycomb	Best query/exploration UI	SaaS pricing at scale
Datadog APM	Full observability platform	Expensive per-host licensing
AWS X-Ray	Native AWS integration	Limited cross-cloud support

For most teams starting out, the OpenTelemetry Collector exporting to Jaeger or Tempo provides the best balance of capability and cost.

Getting Started Checklist

Add OpenTelemetry SDK and auto-instrumentation to your most critical service
Export to a local Jaeger instance for development
Verify spans appear and traces connect across at least two services
Add manual spans for key business operations
Configure head-based sampling at 10%
Inject trace IDs into structured logs
Roll out to remaining services one at a time
Add tail-based sampling once volume warrants it

Start with one service. Prove the value. Then expand. Tracing that covers 80% of your request flow is infinitely more valuable than a perfect instrumentation plan that never ships.