Log Management & Centralized Logging

Centralized logging is the difference between debugging in production and guessing in production. When a user reports “my order disappeared,” you need to search across 20 services to trace what happened. Without centralized logging, that means SSH-ing into 20 servers and grepping log files. With centralized logging, it’s a single query.

Centralized Logging Architecture

Application Pods                Log Agent            Log Store           UI
┌──────────┐                 ┌──────────┐        ┌──────────────┐   ┌──────────┐
│ Service A├── stdout/err ──▶│          │        │              │   │          │
│ Service B├── stdout/err ──▶│ Fluent   │───────▶│ Elasticsearch│──▶│ Kibana   │
│ Service C├── stdout/err ──▶│ Bit /    │        │ (or Loki)    │   │ (or      │
│ Service D├── stdout/err ──▶│ Fluentd  │        │              │   │  Grafana)│
└──────────┘                 └──────────┘        └──────────────┘   └──────────┘

Structured Logging

# BAD: Unstructured logging
logger.info(f"Order {order_id} created by user {user_id} for ${amount}")
# Output: "Order 12345 created by user 789 for $99.99"
# Problem: Can't search or filter by order_id, user_id, or amount

# GOOD: Structured logging (JSON)
logger.info("order_created", extra={
    "order_id": order_id,
    "user_id": user_id,
    "amount": amount,
    "currency": "USD",
    "items_count": len(items),
    "payment_method": "credit_card",
})
# Output: {"message": "order_created", "order_id": 12345, 
#          "user_id": 789, "amount": 99.99, ...}
# Benefit: Search by any field, aggregate, alert

Log Levels

Level	When to Use	Example
ERROR	Something broke, needs attention	Database connection failed
WARN	Something unexpected, not broken yet	Retry succeeded after failure
INFO	Normal business events	Order created, user logged in
DEBUG	Development/troubleshooting details	Query parameters, cache hit/miss
TRACE	Very verbose, rarely needed	Full request/response bodies

Production Log Level Strategy

production:
  default_level: INFO
  
  per_service_override:
    order-service: INFO
    payment-service: INFO   # Keep for compliance audit
    recommendation-service: WARN  # High volume, reduce noise
  
  temporary_debug:
    # Enable DEBUG for specific user/request for troubleshooting
    mechanism: feature_flag
    example: "debug_logging_user_789 = true"
    auto_expire: "30 minutes"

Platform Comparison

Feature	ELK Stack	Grafana Loki	Datadog Logs	CloudWatch
Cost model	Self-hosted + storage	Low (labels, not full-text)	Per GB ingested	Per GB ingested
Full-text search	Excellent	Limited (label-based)	Good	Basic
Scalability	Complex to scale	Simple (S3 backend)	Managed	Managed
Best for	Large-scale search	Kubernetes + Grafana	All-in-one platform	AWS-native

Anti-Patterns

Anti-Pattern	Problem	Fix
Unstructured log messages	Can’t search or aggregate	Structured JSON logging
DEBUG level in production	Storage explosion, noise	INFO default, DEBUG via feature flag
Logging PII	Compliance violation	Sanitize PII before logging
No correlation ID	Can’t trace request across services	Inject trace ID in every log entry
No log retention policy	Storage grows forever	Retention: hot (7 days), warm (30 days), cold (90 days)
Alerting on log patterns	Fragile, breaks on message changes	Alert on structured fields, not text patterns

Checklist

Centralized logging platform deployed (ELK, Loki, or managed)
All services: structured JSON logging to stdout
Correlation ID (trace ID) in every log entry
Log levels: INFO in production, DEBUG via feature flag
PII sanitized from all log output
Retention policy: hot/warm/cold tiers
Dashboards: error rates, log volume by service
Alerts: structured field-based, not text pattern matching
Log access controls: who can see which logs

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For logging architecture consulting, visit garnetgrid.com. :::

Centralized Logging Architecture

Structured Logging

Log Levels

Production Log Level Strategy

Platform Comparison

Anti-Patterns

Checklist

More in DevOps & CI/CD

Chaos Engineering in Practice

Canary Deployments

CI/CD Pipeline Maturity Model