Chaos Engineering: Breaking Things on Purpose to Build Confidence
Implement chaos engineering practices that strengthen your systems without causing real outages. Covers experiment design, blast radius control, steady state hypothesis, game day facilitation, and the maturity model from ad-hoc chaos to continuous resilience verification.
Chaos engineering is the discipline of experimenting on a system to build confidence in its ability to withstand turbulent conditions. It is not breaking things randomly. It is asking specific questions — “what happens when this database becomes slow?” — and running controlled experiments to find out.
The distinction matters because chaos without discipline is just bad operations. Chaos with discipline reveals the failures that are waiting to happen at 3 AM on Black Friday.
The Chaos Engineering Process
1. Define steady state
"Our checkout flow processes 500 requests/min with p99 latency < 2s
and error rate < 0.1%"
2. Form hypothesis
"If we lose 30% of checkout-api pods, the system will
auto-scale and maintain steady state within 2 minutes"
3. Design experiment
- What: Kill 30% of checkout-api pods
- Where: Staging environment first, then production
- When: Business hours, low-traffic period
- Duration: 5 minutes
- Abort criteria: Error rate > 5% or latency > 10s
4. Run experiment
Execute in controlled manner with team observing
5. Analyze results
Did steady state hold? If not, what broke and why?
6. Fix and re-test
Address weaknesses found, re-run experiment to verify
Steady State Hypothesis
The steady state is the normal operating behavior of your system. You must define this before running any experiment, or you will not know whether the system degraded.
| Service | Steady State Metric | Acceptable Deviation |
|---|---|---|
| API Gateway | p99 latency < 500ms | Up to 800ms during experiment |
| Checkout | Success rate > 99.9% | No lower than 99.5% |
| Search | Results returned in < 1s | Up to 3s during experiment |
| Payment processing | Transaction success > 99.95% | No deviation acceptable (abort) |
# Experiment definition
experiment:
name: "Pod failure resilience - checkout API"
hypothesis: "Killing 30% of checkout-api pods will not degrade
checkout success rate below 99.5%"
steady_state:
metrics:
- name: "checkout_success_rate"
query: "sum(rate(checkout_success_total[5m])) / sum(rate(checkout_total[5m]))"
threshold: ">= 0.995"
- name: "checkout_p99_latency"
query: "histogram_quantile(0.99, rate(checkout_duration_seconds_bucket[5m]))"
threshold: "<= 2.0"
action:
type: "pod-kill"
target: "deployment/checkout-api"
percentage: 30
namespace: "production"
duration: "5m"
abort_conditions:
- "checkout_success_rate < 0.95"
- "checkout_p99_latency > 10.0"
rollback:
automatic: true
on: "abort_condition_met"
Experiment Categories
| Category | What You Test | Example Experiments |
|---|---|---|
| Infrastructure | Node, pod, and container failures | Kill nodes, drain nodes, kill pods |
| Network | Network partitions, latency, packet loss | Add 500ms latency between services, block traffic to database |
| Application | Service degradation, dependency failure | Shut down a downstream API, corrupt cache |
| Resource | CPU/memory/disk pressure | Fill disk to 95%, spike CPU to 100% |
| State | Data corruption, clock skew | Advance system clock 24 hours, corrupt a config file |
Starting Experiments (Beginner → Advanced)
Level 1: Staging only
- Kill a single pod and verify auto-restart
- Block access to a non-critical dependency
- Simulate high CPU usage
Level 2: Production (controlled)
- Kill pods during low-traffic period
- Inject latency on internal service calls
- Disable a cache and observe database behavior
Level 3: Production (realistic)
- Simulate region failure (shift traffic off one region)
- Inject packet loss on primary database connection
- Kill a percentage of pods during peak traffic
Level 4: Automated & continuous
- Run experiments automatically on every deployment
- Continuous background chaos with automatic abort
- Full-scale game days simulating major failures
Blast Radius Control
The defining characteristic of chaos engineering vs “just breaking stuff” is that the blast radius is controlled and the abort conditions are defined before the experiment starts.
BEFORE running any chaos experiment:
✅ Experiment runs in staging first
✅ Team is aware and observing
✅ Abort conditions are automated
✅ Rollback is tested and immediate
✅ Blast radius is limited (one service, one region, subset of pods)
✅ Customer-facing impact is acceptable (< defined threshold)
✅ Supporting team (SRE/on-call) is available and aware
❌ NEVER:
❌ Run chaos in production without fallback
❌ Run chaos on data stores without verified backups
❌ Run chaos during peak traffic without executive approval
❌ Run chaos that affects financial transactions without safeguards
Game Days: Structured Chaos
A game day is a scheduled event where a team deliberately introduces failures and practices responding. It is a fire drill for your systems and your people.
Game Day Agenda
30 min before:
- Brief all participants on scenario and roles
- Verify monitoring dashboards are visible
- Confirm abort and rollback procedures
- Announce game day to broader org (avoid false alarms)
During game day (2-4 hours):
- Facilitator introduces failures one at a time
- Team responds as if it is a real incident
- Scribe documents timeline, decisions, and observations
- Facilitator increases blast radius if team handles well
- Emergency abort if real customer impact exceeds threshold
After game day:
- Immediate debrief (30 min): what surprised us?
- Documented findings within 24 hours
- Action items with owners and due dates
- Share findings with broader engineering org
Game Day Scenario Examples
| Scenario | What It Tests | Expected Outcome |
|---|---|---|
| ”Primary database is down” | Failover to replica, read-only mode | Auto-failover in < 30 seconds |
| ”AWS us-east-1 is unavailable” | Multi-region recovery | Traffic shifts to us-west-2 in < 5 min |
| ”Credentials have been leaked” | Secret rotation process | All secrets rotated in < 1 hour |
| ”Payment processor is returning 503” | Graceful degradation | Users see “service temporarily unavailable” not crash |
| ”Cache is completely empty” | Cold start / thundering herd | Database handles load without falling over |
Tools
| Tool | Type | Best For |
|---|---|---|
| Chaos Monkey (Netflix) | Random pod termination | Kubernetes pod resilience |
| Litmus | Kubernetes-native chaos | Kubernetes experiments with CRDs |
| Gremlin | Commercial platform | Enterprise teams wanting UI + support |
| Chaos Mesh | Kubernetes-native, CNCF | Comprehensive K8s chaos (network, IO, time) |
| Toxiproxy | Network chaos proxy | Simulating network conditions between services |
| tc (Linux) | Network manipulation | Low-level latency and packet loss injection |
Implementation Checklist
- Start with steady state definition for your top 3 critical services
- Run your first experiment in staging: kill a single pod, verify auto-restart
- Document abort conditions for every experiment before running it
- Schedule your first game day (pick a low-risk scenario)
- Implement automated abort: if error rate spikes, chaos stops automatically
- Include chaos experiments in post-incident action items (test the fix)
- Progress through maturity levels: staging → production controlled → production automated
- Train at least 3 engineers in game day facilitation
- Share game day findings organization-wide (build chaos culture)
- Integrate chaos tests into deployment pipeline for critical services