Log Aggregation and Analysis: Making Sense of Distributed Systems

A single request in a modern distributed system can generate log lines from an API gateway, a load balancer, an authentication service, a business logic service, a database proxy, and a cache layer. The information you need to debug a problem is scattered across all of them.

Log aggregation collects these scattered log lines into a single searchable system where you can correlate events across services, trace a request end-to-end, and find the needle in a haystack of millions of log lines.

Structured Logging Standard

❌ Unstructured (grep-hostile):
  2024-03-15 14:23:45 INFO User john@example.com logged in from 192.168.1.1

✅ Structured (query-friendly):
  {
    "timestamp": "2024-03-15T14:23:45.123Z",
    "level": "info",
    "service": "auth-api",
    "message": "User login successful",
    "user_id": "usr_abc123",
    "email": "john@example.com",
    "ip": "192.168.1.1",
    "trace_id": "abc-def-123",
    "request_id": "req-789",
    "duration_ms": 145
  }

Required Fields

Field	Purpose	Example
`timestamp`	When (ISO 8601, UTC)	`2024-03-15T14:23:45.123Z`
`level`	Severity	`debug`, `info`, `warn`, `error`, `fatal`
`service`	Which service	`checkout-api`, `user-service`
`message`	What happened (human-readable)	`Payment processed successfully`
`trace_id`	Distributed trace correlation	`abc-def-123`
`request_id`	Single request identifier	`req-789`

Optional But Valuable

Field	Purpose
`user_id`	Who triggered this action
`duration_ms`	How long the operation took
`error_code`	Specific error classification
`stack_trace`	For error logs, the full stack trace
`environment`	production, staging, development
`version`	Application version / commit hash

Collection Architecture

Application Pods                  Collection               Storage & Search
┌─────────────┐
│ App + Logger │──→ stdout ──→ ┌──────────────┐     ┌───────────────────┐
└─────────────┘                │              │     │                   │
┌─────────────┐                │  Log Agent   │────→│  Central Store    │
│ App + Logger │──→ stdout ──→ │  (Fluent Bit,│     │  (Elasticsearch,  │
└─────────────┘                │   Vector,    │     │   Loki, Datadog,  │
┌─────────────┐                │   Fluentd)   │     │   CloudWatch)     │
│ App + Logger │──→ stdout ──→ │              │     │                   │
└─────────────┘                └──────────────┘     └───────────────────┘
                                     │
                               Enrichment:
                               - Add K8s metadata
                               - Parse JSON
                               - Sample debug logs
                               - Buffer for reliability

Tool Comparison

Tool	Type	Best For	Cost Model
ELK Stack (Elastic, Logstash, Kibana)	Self-hosted	Full control, complex queries	Infrastructure cost
Grafana Loki	Self-hosted / Cloud	Cost-effective, label-based	Ingestion volume
Datadog Logs	SaaS	Unified observability, ease of use	Per GB ingested
AWS CloudWatch	SaaS (AWS)	AWS-native, simple setup	Per GB ingested + stored
Splunk	Self-hosted / SaaS	Enterprise, complex analytics	Per GB indexed (expensive)

Log Levels: When to Use What

Level	When	Example	Volume
`debug`	Detailed diagnostic info (dev only in prod)	`Cache key generated: user:abc:prefs`	Very high
`info`	Normal operations worth recording	`Order ORD-123 placed successfully`	Medium
`warn`	Something unexpected but handled	`Retry 2/3 for payment API`	Low
`error`	Something failed but system continues	`Payment failed: card declined`	Low
`fatal`	System cannot continue	`Database connection pool exhausted`	Very low

Production log level: INFO and above

  debug: OFF in production (unless debugging a specific issue)
         Turn on per-service, per-instance, with a feature flag
         Auto-disable after 30 minutes

  info:  ON — the primary log level for production
  warn:  ON — review weekly for patterns
  error: ON — alert, investigate, fix
  fatal: ON — page immediately

Retention and Cost Management

Log Type	Retention	Reason
Error logs	90 days	Investigation and trend analysis
Info logs	30 days	Recent debugging, audit
Debug logs	3 days	Only enabled temporarily
Audit logs	1-7 years	Compliance requirements
Access logs	90 days	Security investigations

# Lifecycle policy: move old logs to cheaper storage
lifecycle_policy:
  hot:
    duration: 7 days
    storage: SSD (fast search)

  warm:
    duration: 23 days
    storage: HDD (slower, cheaper)

  cold:
    duration: 60 days
    storage: S3 / object storage (cheapest, slow retrieval)

  delete:
    after: 90 days (info) / 365 days (audit)

Query Patterns

# Common investigation queries:

# 1. Find all errors for a specific trace
trace_id:"abc-def-123" AND level:error

# 2. Top errors in the last hour
level:error | stats count by message | sort count desc

# 3. Slow requests (> 2 seconds)
duration_ms:>2000 AND service:checkout-api

# 4. Specific user's activity
user_id:"usr_abc123" | sort timestamp asc

# 5. Error rate by service (last 24h)
level:error | stats count by service | eval error_rate = count/total_requests

Security Considerations

Practice	Why
Never log passwords, tokens, or API keys	Logs are often less secured than production systems
Mask PII (emails, SSNs, credit cards)	GDPR, PCI-DSS compliance
Encrypt logs in transit and at rest	Prevent unauthorized access to log data
Restrict log access by role	Not everyone needs to see all logs
Audit who queries logs	Detect unauthorized data access

# PII masking before logging
import re

def mask_pii(log_message: str) -> str:
    # Mask email addresses
    log_message = re.sub(
        r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}',
        '***@***.***', log_message
    )
    # Mask credit card numbers
    log_message = re.sub(
        r'\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b',
        '****-****-****-****', log_message
    )
    return log_message

Structured Logging Standard

Required Fields

Optional But Valuable

Collection Architecture

Tool Comparison

Log Levels: When to Use What

Retention and Cost Management

Query Patterns

Security Considerations

Implementation Checklist

More in DevOps & CI/CD

Chaos Engineering in Practice

Canary Deployments

CI/CD Pipeline Maturity Model