Log Management & Centralized Logging
Build a centralized logging system. Covers structured logging, log aggregation with ELK/Loki, log levels, retention policies, and searching logs effectively in production.
Centralized logging is the difference between debugging in production and guessing in production. When a user reports “my order disappeared,” you need to search across 20 services to trace what happened. Without centralized logging, that means SSH-ing into 20 servers and grepping log files. With centralized logging, it’s a single query.
Centralized Logging Architecture
Application Pods Log Agent Log Store UI
┌──────────┐ ┌──────────┐ ┌──────────────┐ ┌──────────┐
│ Service A├── stdout/err ──▶│ │ │ │ │ │
│ Service B├── stdout/err ──▶│ Fluent │───────▶│ Elasticsearch│──▶│ Kibana │
│ Service C├── stdout/err ──▶│ Bit / │ │ (or Loki) │ │ (or │
│ Service D├── stdout/err ──▶│ Fluentd │ │ │ │ Grafana)│
└──────────┘ └──────────┘ └──────────────┘ └──────────┘
Structured Logging
# BAD: Unstructured logging
logger.info(f"Order {order_id} created by user {user_id} for ${amount}")
# Output: "Order 12345 created by user 789 for $99.99"
# Problem: Can't search or filter by order_id, user_id, or amount
# GOOD: Structured logging (JSON)
logger.info("order_created", extra={
"order_id": order_id,
"user_id": user_id,
"amount": amount,
"currency": "USD",
"items_count": len(items),
"payment_method": "credit_card",
})
# Output: {"message": "order_created", "order_id": 12345,
# "user_id": 789, "amount": 99.99, ...}
# Benefit: Search by any field, aggregate, alert
Log Levels
| Level | When to Use | Example |
|---|---|---|
| ERROR | Something broke, needs attention | Database connection failed |
| WARN | Something unexpected, not broken yet | Retry succeeded after failure |
| INFO | Normal business events | Order created, user logged in |
| DEBUG | Development/troubleshooting details | Query parameters, cache hit/miss |
| TRACE | Very verbose, rarely needed | Full request/response bodies |
Production Log Level Strategy
production:
default_level: INFO
per_service_override:
order-service: INFO
payment-service: INFO # Keep for compliance audit
recommendation-service: WARN # High volume, reduce noise
temporary_debug:
# Enable DEBUG for specific user/request for troubleshooting
mechanism: feature_flag
example: "debug_logging_user_789 = true"
auto_expire: "30 minutes"
Platform Comparison
| Feature | ELK Stack | Grafana Loki | Datadog Logs | CloudWatch |
|---|---|---|---|---|
| Cost model | Self-hosted + storage | Low (labels, not full-text) | Per GB ingested | Per GB ingested |
| Full-text search | Excellent | Limited (label-based) | Good | Basic |
| Scalability | Complex to scale | Simple (S3 backend) | Managed | Managed |
| Best for | Large-scale search | Kubernetes + Grafana | All-in-one platform | AWS-native |
Anti-Patterns
| Anti-Pattern | Problem | Fix |
|---|---|---|
| Unstructured log messages | Can’t search or aggregate | Structured JSON logging |
| DEBUG level in production | Storage explosion, noise | INFO default, DEBUG via feature flag |
| Logging PII | Compliance violation | Sanitize PII before logging |
| No correlation ID | Can’t trace request across services | Inject trace ID in every log entry |
| No log retention policy | Storage grows forever | Retention: hot (7 days), warm (30 days), cold (90 days) |
| Alerting on log patterns | Fragile, breaks on message changes | Alert on structured fields, not text patterns |
Checklist
- Centralized logging platform deployed (ELK, Loki, or managed)
- All services: structured JSON logging to stdout
- Correlation ID (trace ID) in every log entry
- Log levels: INFO in production, DEBUG via feature flag
- PII sanitized from all log output
- Retention policy: hot/warm/cold tiers
- Dashboards: error rates, log volume by service
- Alerts: structured field-based, not text pattern matching
- Log access controls: who can see which logs
:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For logging architecture consulting, visit garnetgrid.com. :::