A single request in a modern distributed system can generate log lines from an API gateway, a load balancer, an authentication service, a business logic service, a database proxy, and a cache layer. The information you need to debug a problem is scattered across all of them.
Log aggregation collects these scattered log lines into a single searchable system where you can correlate events across services, trace a request end-to-end, and find the needle in a haystack of millions of log lines.
Structured Logging Standard
❌ Unstructured (grep-hostile):
2024-03-15 14:23:45 INFO User john@example.com logged in from 192.168.1.1
✅ Structured (query-friendly):
{
"timestamp": "2024-03-15T14:23:45.123Z",
"level": "info",
"service": "auth-api",
"message": "User login successful",
"user_id": "usr_abc123",
"email": "john@example.com",
"ip": "192.168.1.1",
"trace_id": "abc-def-123",
"request_id": "req-789",
"duration_ms": 145
}
Required Fields
| Field | Purpose | Example |
|---|
timestamp | When (ISO 8601, UTC) | 2024-03-15T14:23:45.123Z |
level | Severity | debug, info, warn, error, fatal |
service | Which service | checkout-api, user-service |
message | What happened (human-readable) | Payment processed successfully |
trace_id | Distributed trace correlation | abc-def-123 |
request_id | Single request identifier | req-789 |
Optional But Valuable
| Field | Purpose |
|---|
user_id | Who triggered this action |
duration_ms | How long the operation took |
error_code | Specific error classification |
stack_trace | For error logs, the full stack trace |
environment | production, staging, development |
version | Application version / commit hash |
Collection Architecture
Application Pods Collection Storage & Search
┌─────────────┐
│ App + Logger │──→ stdout ──→ ┌──────────────┐ ┌───────────────────┐
└─────────────┘ │ │ │ │
┌─────────────┐ │ Log Agent │────→│ Central Store │
│ App + Logger │──→ stdout ──→ │ (Fluent Bit,│ │ (Elasticsearch, │
└─────────────┘ │ Vector, │ │ Loki, Datadog, │
┌─────────────┐ │ Fluentd) │ │ CloudWatch) │
│ App + Logger │──→ stdout ──→ │ │ │ │
└─────────────┘ └──────────────┘ └───────────────────┘
│
Enrichment:
- Add K8s metadata
- Parse JSON
- Sample debug logs
- Buffer for reliability
| Tool | Type | Best For | Cost Model |
|---|
| ELK Stack (Elastic, Logstash, Kibana) | Self-hosted | Full control, complex queries | Infrastructure cost |
| Grafana Loki | Self-hosted / Cloud | Cost-effective, label-based | Ingestion volume |
| Datadog Logs | SaaS | Unified observability, ease of use | Per GB ingested |
| AWS CloudWatch | SaaS (AWS) | AWS-native, simple setup | Per GB ingested + stored |
| Splunk | Self-hosted / SaaS | Enterprise, complex analytics | Per GB indexed (expensive) |
Log Levels: When to Use What
| Level | When | Example | Volume |
|---|
debug | Detailed diagnostic info (dev only in prod) | Cache key generated: user:abc:prefs | Very high |
info | Normal operations worth recording | Order ORD-123 placed successfully | Medium |
warn | Something unexpected but handled | Retry 2/3 for payment API | Low |
error | Something failed but system continues | Payment failed: card declined | Low |
fatal | System cannot continue | Database connection pool exhausted | Very low |
Production log level: INFO and above
debug: OFF in production (unless debugging a specific issue)
Turn on per-service, per-instance, with a feature flag
Auto-disable after 30 minutes
info: ON — the primary log level for production
warn: ON — review weekly for patterns
error: ON — alert, investigate, fix
fatal: ON — page immediately
Retention and Cost Management
| Log Type | Retention | Reason |
|---|
| Error logs | 90 days | Investigation and trend analysis |
| Info logs | 30 days | Recent debugging, audit |
| Debug logs | 3 days | Only enabled temporarily |
| Audit logs | 1-7 years | Compliance requirements |
| Access logs | 90 days | Security investigations |
# Lifecycle policy: move old logs to cheaper storage
lifecycle_policy:
hot:
duration: 7 days
storage: SSD (fast search)
warm:
duration: 23 days
storage: HDD (slower, cheaper)
cold:
duration: 60 days
storage: S3 / object storage (cheapest, slow retrieval)
delete:
after: 90 days (info) / 365 days (audit)
Query Patterns
# Common investigation queries:
# 1. Find all errors for a specific trace
trace_id:"abc-def-123" AND level:error
# 2. Top errors in the last hour
level:error | stats count by message | sort count desc
# 3. Slow requests (> 2 seconds)
duration_ms:>2000 AND service:checkout-api
# 4. Specific user's activity
user_id:"usr_abc123" | sort timestamp asc
# 5. Error rate by service (last 24h)
level:error | stats count by service | eval error_rate = count/total_requests
Security Considerations
| Practice | Why |
|---|
| Never log passwords, tokens, or API keys | Logs are often less secured than production systems |
| Mask PII (emails, SSNs, credit cards) | GDPR, PCI-DSS compliance |
| Encrypt logs in transit and at rest | Prevent unauthorized access to log data |
| Restrict log access by role | Not everyone needs to see all logs |
| Audit who queries logs | Detect unauthorized data access |
# PII masking before logging
import re
def mask_pii(log_message: str) -> str:
# Mask email addresses
log_message = re.sub(
r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}',
'***@***.***', log_message
)
# Mask credit card numbers
log_message = re.sub(
r'\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b',
'****-****-****-****', log_message
)
return log_message
Implementation Checklist