ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

Chaos Engineering in Practice

Build resilient systems through chaos engineering. Covers failure injection, blast radius control, GameDay exercises, steady-state hypothesis, and tools like Chaos Monkey and Litmus.

Chaos engineering proactively injects failures into production systems to discover weaknesses before they cause real outages. The logic is simple: if your system can’t handle a single server failure in a controlled experiment, it definitely can’t handle one at 3 AM during peak traffic.


The Process

1. Define steady state
   └── "Order success rate > 99.5%, latency p99 < 500ms"

2. Hypothesize
   └── "System maintains steady state if one database replica fails"

3. Inject failure
   └── Kill one of three database replicas

4. Observe
   └── Monitor metrics: success rate, latency, error rate

5. Learn
   └── If steady state maintained: confidence increased
       If steady state broken: fix the weakness, re-test

Experiment Types

CategoryExperimentWhat You Learn
InfrastructureKill a server/podFailover works, auto-scaling responds
NetworkAdd 200ms latency between servicesTimeouts configured, circuit breakers work
DependenciesBlock access to external APIFallbacks/caches activate
DataCorrupt or delay database responsesApplication handles gracefully
ResourceExhaust CPU/memory/disk on one nodeScheduling/eviction works correctly

Blast Radius Control

EnvironmentBlast RadiusExample
Local/DevSingle instanceTest failure handling in unit tests
StagingFull environmentSimulate production failures safely
Production (canary)Single pod/instanceInject failure into 1 of N instances
Production (wide)Availability zoneAZ failure, test multi-AZ resilience

Tools

ToolScopeBest For
Chaos MonkeyKill instances randomlyEC2/cloud instances
LitmusKubernetes chaos experimentsK8s-native chaos
GremlinEnterprise chaos platformManaged, compliance-ready
Chaos MeshK8s chaos (network, I/O, time)Kubernetes-focused
AWS FISAWS service-level failuresAWS infrastructure
toxiproxyNetwork-level chaos (proxy)Network fault injection

Anti-Patterns

Anti-PatternProblemFix
Chaos without monitoringCan’t observe the impactObservability first, chaos second
No hypothesis”Let’s see what happens” isn’t engineeringDefine steady state + expected behavior
Starting in productionFirst experiment takes down prodStart in staging, graduate to production
No blast radius limitExperiment affects all usersStart with 1%, increase gradually
No fix-forward cultureFindings documented but never fixedTrack action items like bugs

Checklist

  • Observability in place before starting chaos experiments
  • Steady-state metrics defined (SLIs/SLOs)
  • Experiments start in staging before production
  • Blast radius controlled (single instance → AZ → region)
  • Kill switch: ability to stop experiment immediately
  • GameDay exercises scheduled quarterly
  • Findings tracked as action items with owners
  • Team trained on chaos engineering principles

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For chaos engineering consulting, visit garnetgrid.com. :::

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →