ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

Distributed Tracing: Following Requests Across Service Boundaries

Implement distributed tracing to debug latency, identify bottlenecks, and understand request flow across microservices. Covers OpenTelemetry, trace propagation, span design, sampling strategies, and integrating traces with logs and metrics for full observability.

Every monolith-to-microservices migration hits the same wall. A request that used to be a single stack trace now crosses five services, three databases, and two message queues. When that request takes 4 seconds instead of 400 milliseconds, nobody knows where the time went.

Distributed tracing solves this by assigning a unique trace ID to every request at the edge and propagating it through every service, queue, and database call. Each unit of work becomes a span — a timed operation with metadata — and the collection of spans for a single request becomes a trace.

The result is a timeline view showing exactly where your time goes.


Core Concepts

Traces, Spans, and Context

A trace represents the entire journey of a request through your system. It is composed of spans, each representing a discrete operation:

Trace: abc123
├── Span: API Gateway (12ms)
│   ├── Span: Auth Service (3ms)
│   └── Span: Order Service (45ms)
│       ├── Span: Inventory Check (8ms)
│       ├── Span: Payment Service (22ms)
│       │   └── Span: Stripe API Call (18ms)
│       └── Span: Database Write (6ms)

Each span carries:

  • Trace ID: Shared across all spans in the trace
  • Span ID: Unique to this span
  • Parent Span ID: Links child to parent
  • Start/End timestamps: For duration calculation
  • Attributes: Key-value metadata (user ID, HTTP status, etc.)
  • Events: Timestamped annotations within the span
  • Status: OK, ERROR, or UNSET

Context Propagation

Context propagation is the mechanism that carries trace identity across process boundaries. Without it, each service creates isolated traces that cannot be correlated.

The W3C Trace Context standard defines two HTTP headers:

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
tracestate: vendor1=value1,vendor2=value2

The traceparent header encodes version, trace ID, parent span ID, and trace flags. Every service extracts this header, creates a child span with the same trace ID, and injects the updated header into outgoing requests.


OpenTelemetry Implementation

OpenTelemetry has become the standard for instrumentation. It provides vendor-neutral SDKs for generating traces, metrics, and logs.

Auto-Instrumentation

Most frameworks have auto-instrumentation libraries that capture HTTP, database, and messaging spans without code changes:

# Python: Auto-instrument Flask + SQLAlchemy + requests
from opentelemetry import trace
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor

FlaskInstrumentor().instrument_app(app)
SQLAlchemyInstrumentor().instrument(engine=engine)
RequestsInstrumentor().instrument()

This captures every incoming HTTP request, every SQL query, and every outgoing HTTP call as spans — automatically linked into traces.

Manual Instrumentation

Auto-instrumentation captures infrastructure. Manual instrumentation captures business logic:

tracer = trace.get_tracer("order-service")

def process_order(order_id: str, items: list):
    with tracer.start_as_current_span("process_order") as span:
        span.set_attribute("order.id", order_id)
        span.set_attribute("order.item_count", len(items))
        
        with tracer.start_as_current_span("validate_inventory"):
            available = check_inventory(items)
            if not available:
                span.set_status(StatusCode.ERROR, "Insufficient inventory")
                span.add_event("inventory_check_failed", {
                    "missing_items": str(get_missing(items))
                })
                raise InsufficientInventoryError()
        
        with tracer.start_as_current_span("charge_payment"):
            charge_result = payment_service.charge(order_id, total)
            span.add_event("payment_processed", {
                "amount": str(total),
                "provider": "stripe"
            })

Span Design Principles

What Deserves a Span

Create spans for operations that:

  • Cross a network boundary (HTTP, gRPC, message queue)
  • Involve I/O (database queries, file reads, external APIs)
  • Represent significant business logic (order processing, fraud checks)
  • Are potential latency sources you need to measure

Do not create spans for:

  • Pure computation under 1ms
  • Simple variable assignments or transformations
  • Every function call (this creates “span explosion”)

Attribute Standards

Standardize attributes across services using semantic conventions:

# HTTP spans
http.method: GET
http.url: /api/orders/123
http.status_code: 200
http.response_content_length: 4523

# Database spans
db.system: postgresql
db.statement: SELECT * FROM orders WHERE id = ?
db.operation: SELECT

# Custom business attributes
order.id: ORD-2026-0456
order.total: 2499.99
customer.tier: enterprise

Sampling Strategies

At scale, tracing every request is prohibitively expensive. A service handling 10,000 RPS generating 15 spans per request produces 150,000 spans per second — roughly 50 GB of data per day.

Head-Based Sampling

The sampling decision is made at the edge, before the trace starts:

  • Probabilistic: Sample 10% of traces randomly
  • Rate-limiting: Sample at most 100 traces per second
  • Rule-based: Sample all traces matching specific criteria (e.g., admin users, high-value orders)

The advantage is simplicity. The disadvantage is that you might miss the exact trace showing a rare error.

Tail-Based Sampling

The decision is deferred until the trace is complete. A collector buffers all spans and only exports traces meeting certain criteria:

  • Duration exceeds threshold (>2s)
  • Contains error status codes
  • Matches attribute filters (specific customer, specific endpoint)

Tail-based sampling captures exactly the traces you care about — errors and slow requests — but requires buffering infrastructure and adds latency to trace visibility.

Practical Recommendation

Use a combination: head-based sampling at 5-10% for baseline, plus tail-based sampling for errors and latency outliers. This gives you statistical coverage and guaranteed visibility of problems.


Integrating Traces with Logs and Metrics

Traces become dramatically more powerful when correlated with logs and metrics.

Trace-Log Correlation

Inject the trace ID into every log line:

import logging
from opentelemetry import trace

class TraceIdFilter(logging.Filter):
    def filter(self, record):
        span = trace.get_current_span()
        ctx = span.get_span_context()
        record.trace_id = format(ctx.trace_id, '032x') if ctx.trace_id else ''
        record.span_id = format(ctx.span_id, '016x') if ctx.span_id else ''
        return True

logger = logging.getLogger(__name__)
logger.addFilter(TraceIdFilter())
handler.setFormatter(logging.Formatter(
    '%(asctime)s [%(trace_id)s:%(span_id)s] %(levelname)s %(message)s'
))

Now every log line carries its trace ID. When investigating a slow trace, you can query logs by trace ID and see every log message emitted during that request — across all services.

Trace-Metric Correlation (Exemplars)

Metrics tell you that P99 latency spiked. Traces tell you why. Exemplars link the two by attaching a trace ID to a metric data point:

When your dashboard shows a latency spike, clicking the exemplar takes you directly to the trace causing it.


Common Anti-Patterns

Span Explosion

Creating spans for every function call overwhelms your tracing backend and makes traces unreadable. A 500-span trace is noise. A 15-span trace with well-chosen boundaries tells a clear story.

Missing Context Propagation

If even one service in your chain fails to propagate context, the trace breaks into disconnected fragments. Check every service — including proxies, load balancers, and message brokers — for proper header forwarding.

Sensitive Data in Attributes

SQL statements, request bodies, and user emails in span attributes create compliance risks. Scrub sensitive fields before export, or use attribute allowlists.

Ignoring Asynchronous Flows

Message queue consumers need explicit context extraction:

def process_message(message):
    # Extract context from message headers
    ctx = extract(message.headers)
    with tracer.start_as_current_span("process_message", context=ctx):
        handle(message.body)

Without this, async work appears as disconnected traces.


Choosing a Backend

BackendStrengthsConsiderations
JaegerOpen source, battle-testedRequires infrastructure management
Tempo (Grafana)Integrates with Grafana stack, cost-effectiveLog-based storage can be slow
HoneycombBest query/exploration UISaaS pricing at scale
Datadog APMFull observability platformExpensive per-host licensing
AWS X-RayNative AWS integrationLimited cross-cloud support

For most teams starting out, the OpenTelemetry Collector exporting to Jaeger or Tempo provides the best balance of capability and cost.


Getting Started Checklist

  1. Add OpenTelemetry SDK and auto-instrumentation to your most critical service
  2. Export to a local Jaeger instance for development
  3. Verify spans appear and traces connect across at least two services
  4. Add manual spans for key business operations
  5. Configure head-based sampling at 10%
  6. Inject trace IDs into structured logs
  7. Roll out to remaining services one at a time
  8. Add tail-based sampling once volume warrants it

Start with one service. Prove the value. Then expand. Tracing that covers 80% of your request flow is infinitely more valuable than a perfect instrumentation plan that never ships.

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →