Chaos Engineering
Build confidence in system resilience by intentionally injecting failures. Covers chaos experiment design, blast radius control, game days, chaos in CI/CD, and building an organizational culture that embraces controlled failure.
Chaos engineering is not breaking things for fun. It is the disciplined practice of injecting controlled failures into production systems to discover weaknesses before they cause uncontrolled outages. If you do not know how your system behaves when a database connection fails, when a cloud region goes down, or when a dependency times out, you will find out during an incident — at 3 AM, under pressure.
Principles
1. Build a hypothesis about steady state
"Order throughput remains above 100 req/s during the experiment"
2. Vary real-world events
"Kill 1 of 3 database replicas" or "Add 200ms latency to payment API"
3. Run experiments in production
(With appropriate blast radius controls)
4. Automate experiments to run continuously
5. Minimize blast radius
Start small, expand as confidence grows
Experiment Design
Template
experiment:
name: "Database failover resilience"
hypothesis: "When the primary database fails over to a replica,
the application recovers within 30 seconds with
< 1% error rate increase."
steady_state:
- metric: error_rate
condition: "< 0.1%"
- metric: p99_latency
condition: "< 500ms"
- metric: throughput
condition: "> 100 req/s"
method:
action: "Trigger database failover"
tool: "AWS RDS failover API"
duration: "30 seconds"
rollback:
automatic: true
trigger: "error_rate > 5% for 60 seconds"
action: "Abort experiment, failback to original primary"
blast_radius:
scope: "order-service only"
traffic_percentage: 100 # Affects all traffic to order-service
environment: production
Common Experiments
Infrastructure Failures
Kill a container/pod
What happens when a service instance dies?
Expected: Kubernetes restarts it, traffic routes to healthy instances
Kill a node
What happens when an entire server dies?
Expected: Pods reschedule to other nodes within 60 seconds
Network partition
What happens when services cannot reach the database?
Expected: Circuit breaker opens, graceful degradation
DNS failure
What happens when DNS resolution fails?
Expected: Cached DNS entries used, alert fires
Application Failures
Memory pressure
What happens when memory usage reaches 90%?
Expected: OOM killed, Kubernetes restarts, no data loss
CPU saturation
What happens when CPU is maxed out?
Expected: Auto-scaling triggers, latency increases but no errors
Clock skew
What happens when server clock drifts by 5 minutes?
Expected: TLS errors handled, timeouts adjusted
Dependency Failures
Add latency to external API
What happens when payment API takes 5 seconds instead of 200ms?
Expected: Timeout after 3 seconds, retry once, circuit breaker after 5 failures
Return errors from external API
What happens when payment API returns 500 errors?
Expected: Retry 3 times, then fail order with clear error message
Chaos Tools
| Tool | Type | Platform |
|---|---|---|
| Chaos Monkey | Instance termination | Cloud |
| Litmus | Kubernetes chaos | Kubernetes |
| Gremlin | Full chaos platform | Any |
| Chaos Mesh | Kubernetes chaos | Kubernetes |
| toxiproxy | Network fault injection | Any |
| tc (traffic control) | Network simulation | Linux |
Game Days
Scheduled, team-wide chaos exercises:
Game Day Agenda (2-4 hours):
Pre-game (30 min):
- Review experiment plan
- Verify monitoring dashboards
- Confirm rollback procedures
- Brief all participants
Experiments (1-2 hours):
- Run 3-5 planned experiments
- Document observations in real-time
- Practice incident response procedures
Debrief (30-60 min):
- What failed unexpectedly?
- What worked better than expected?
- What monitoring gaps did we discover?
- Action items for resilience improvements
Anti-Patterns
| Anti-Pattern | Consequence | Fix |
|---|---|---|
| Chaos without observability | Cannot measure impact | Observability first, then chaos |
| No rollback plan | Experiment becomes an incident | Automated rollback on thresholds |
| Starting in production at full scale | Unnecessary blast radius | Start in staging, then small production scope |
| One-off experiments | Regression goes undetected | Continuous automated chaos |
| No organizational buy-in | Chaos seen as reckless | Start with game days, build culture |
Chaos engineering is about building confidence. Every experiment that passes proves a resilience claim. Every experiment that fails reveals a weakness to fix before it causes a real outage.