Chaos engineering proactively injects failures into production systems to discover weaknesses before they cause real outages. The logic is simple: if your system can’t handle a single server failure in a controlled experiment, it definitely can’t handle one at 3 AM during peak traffic.
The Process
1. Define steady state
└── "Order success rate > 99.5%, latency p99 < 500ms"
2. Hypothesize
└── "System maintains steady state if one database replica fails"
3. Inject failure
└── Kill one of three database replicas
4. Observe
└── Monitor metrics: success rate, latency, error rate
5. Learn
└── If steady state maintained: confidence increased
If steady state broken: fix the weakness, re-test
Experiment Types
| Category | Experiment | What You Learn |
|---|
| Infrastructure | Kill a server/pod | Failover works, auto-scaling responds |
| Network | Add 200ms latency between services | Timeouts configured, circuit breakers work |
| Dependencies | Block access to external API | Fallbacks/caches activate |
| Data | Corrupt or delay database responses | Application handles gracefully |
| Resource | Exhaust CPU/memory/disk on one node | Scheduling/eviction works correctly |
Blast Radius Control
| Environment | Blast Radius | Example |
|---|
| Local/Dev | Single instance | Test failure handling in unit tests |
| Staging | Full environment | Simulate production failures safely |
| Production (canary) | Single pod/instance | Inject failure into 1 of N instances |
| Production (wide) | Availability zone | AZ failure, test multi-AZ resilience |
| Tool | Scope | Best For |
|---|
| Chaos Monkey | Kill instances randomly | EC2/cloud instances |
| Litmus | Kubernetes chaos experiments | K8s-native chaos |
| Gremlin | Enterprise chaos platform | Managed, compliance-ready |
| Chaos Mesh | K8s chaos (network, I/O, time) | Kubernetes-focused |
| AWS FIS | AWS service-level failures | AWS infrastructure |
| toxiproxy | Network-level chaos (proxy) | Network fault injection |
Anti-Patterns
| Anti-Pattern | Problem | Fix |
|---|
| Chaos without monitoring | Can’t observe the impact | Observability first, chaos second |
| No hypothesis | ”Let’s see what happens” isn’t engineering | Define steady state + expected behavior |
| Starting in production | First experiment takes down prod | Start in staging, graduate to production |
| No blast radius limit | Experiment affects all users | Start with 1%, increase gradually |
| No fix-forward culture | Findings documented but never fixed | Track action items like bugs |
Checklist
:::note[Source]
This guide is derived from operational intelligence at Garnet Grid Consulting. For chaos engineering consulting, visit garnetgrid.com.
:::
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting
Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.
View Full Profile →