ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

Log Aggregation and Analysis: Making Sense of Distributed Systems

Build a log aggregation pipeline that collects, processes, and makes searchable the millions of log lines your distributed system produces. Covers structured logging standards, collection architectures, storage strategies, query patterns, retention policies, and the pipeline that turns noise into signal.

A single request in a modern distributed system can generate log lines from an API gateway, a load balancer, an authentication service, a business logic service, a database proxy, and a cache layer. The information you need to debug a problem is scattered across all of them.

Log aggregation collects these scattered log lines into a single searchable system where you can correlate events across services, trace a request end-to-end, and find the needle in a haystack of millions of log lines.


Structured Logging Standard

❌ Unstructured (grep-hostile):
  2024-03-15 14:23:45 INFO User john@example.com logged in from 192.168.1.1

✅ Structured (query-friendly):
  {
    "timestamp": "2024-03-15T14:23:45.123Z",
    "level": "info",
    "service": "auth-api",
    "message": "User login successful",
    "user_id": "usr_abc123",
    "email": "john@example.com",
    "ip": "192.168.1.1",
    "trace_id": "abc-def-123",
    "request_id": "req-789",
    "duration_ms": 145
  }

Required Fields

FieldPurposeExample
timestampWhen (ISO 8601, UTC)2024-03-15T14:23:45.123Z
levelSeveritydebug, info, warn, error, fatal
serviceWhich servicecheckout-api, user-service
messageWhat happened (human-readable)Payment processed successfully
trace_idDistributed trace correlationabc-def-123
request_idSingle request identifierreq-789

Optional But Valuable

FieldPurpose
user_idWho triggered this action
duration_msHow long the operation took
error_codeSpecific error classification
stack_traceFor error logs, the full stack trace
environmentproduction, staging, development
versionApplication version / commit hash

Collection Architecture

Application Pods                  Collection               Storage & Search
┌─────────────┐
│ App + Logger │──→ stdout ──→ ┌──────────────┐     ┌───────────────────┐
└─────────────┘                │              │     │                   │
┌─────────────┐                │  Log Agent   │────→│  Central Store    │
│ App + Logger │──→ stdout ──→ │  (Fluent Bit,│     │  (Elasticsearch,  │
└─────────────┘                │   Vector,    │     │   Loki, Datadog,  │
┌─────────────┐                │   Fluentd)   │     │   CloudWatch)     │
│ App + Logger │──→ stdout ──→ │              │     │                   │
└─────────────┘                └──────────────┘     └───────────────────┘

                               Enrichment:
                               - Add K8s metadata
                               - Parse JSON
                               - Sample debug logs
                               - Buffer for reliability

Tool Comparison

ToolTypeBest ForCost Model
ELK Stack (Elastic, Logstash, Kibana)Self-hostedFull control, complex queriesInfrastructure cost
Grafana LokiSelf-hosted / CloudCost-effective, label-basedIngestion volume
Datadog LogsSaaSUnified observability, ease of usePer GB ingested
AWS CloudWatchSaaS (AWS)AWS-native, simple setupPer GB ingested + stored
SplunkSelf-hosted / SaaSEnterprise, complex analyticsPer GB indexed (expensive)

Log Levels: When to Use What

LevelWhenExampleVolume
debugDetailed diagnostic info (dev only in prod)Cache key generated: user:abc:prefsVery high
infoNormal operations worth recordingOrder ORD-123 placed successfullyMedium
warnSomething unexpected but handledRetry 2/3 for payment APILow
errorSomething failed but system continuesPayment failed: card declinedLow
fatalSystem cannot continueDatabase connection pool exhaustedVery low
Production log level: INFO and above

  debug: OFF in production (unless debugging a specific issue)
         Turn on per-service, per-instance, with a feature flag
         Auto-disable after 30 minutes

  info:  ON — the primary log level for production
  warn:  ON — review weekly for patterns
  error: ON — alert, investigate, fix
  fatal: ON — page immediately

Retention and Cost Management

Log TypeRetentionReason
Error logs90 daysInvestigation and trend analysis
Info logs30 daysRecent debugging, audit
Debug logs3 daysOnly enabled temporarily
Audit logs1-7 yearsCompliance requirements
Access logs90 daysSecurity investigations
# Lifecycle policy: move old logs to cheaper storage
lifecycle_policy:
  hot:
    duration: 7 days
    storage: SSD (fast search)

  warm:
    duration: 23 days
    storage: HDD (slower, cheaper)

  cold:
    duration: 60 days
    storage: S3 / object storage (cheapest, slow retrieval)

  delete:
    after: 90 days (info) / 365 days (audit)

Query Patterns

# Common investigation queries:

# 1. Find all errors for a specific trace
trace_id:"abc-def-123" AND level:error

# 2. Top errors in the last hour
level:error | stats count by message | sort count desc

# 3. Slow requests (> 2 seconds)
duration_ms:>2000 AND service:checkout-api

# 4. Specific user's activity
user_id:"usr_abc123" | sort timestamp asc

# 5. Error rate by service (last 24h)
level:error | stats count by service | eval error_rate = count/total_requests

Security Considerations

PracticeWhy
Never log passwords, tokens, or API keysLogs are often less secured than production systems
Mask PII (emails, SSNs, credit cards)GDPR, PCI-DSS compliance
Encrypt logs in transit and at restPrevent unauthorized access to log data
Restrict log access by roleNot everyone needs to see all logs
Audit who queries logsDetect unauthorized data access
# PII masking before logging
import re

def mask_pii(log_message: str) -> str:
    # Mask email addresses
    log_message = re.sub(
        r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}',
        '***@***.***', log_message
    )
    # Mask credit card numbers
    log_message = re.sub(
        r'\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b',
        '****-****-****-****', log_message
    )
    return log_message

Implementation Checklist

  • Adopt structured logging (JSON) with mandatory fields: timestamp, level, service, trace_id
  • Log to stdout/stderr — let the platform handle collection
  • Deploy a log agent (Fluent Bit, Vector) that enriches with infrastructure metadata
  • Choose a central log store matching your scale and budget
  • Set production log level to INFO (debug via feature flag, auto-disable after 30 min)
  • Implement PII masking before logs leave the application
  • Define retention policies per log type (error: 90d, info: 30d, audit: 1yr+)
  • Use lifecycle policies to move old logs to cheaper storage tiers
  • Create saved queries for common investigation patterns
  • Alert on error rate spikes, not individual errors
Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →