ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

Chaos Engineering

Build confidence in system resilience by intentionally injecting failures. Covers chaos experiment design, blast radius control, game days, chaos in CI/CD, and building an organizational culture that embraces controlled failure.

Chaos engineering is not breaking things for fun. It is the disciplined practice of injecting controlled failures into production systems to discover weaknesses before they cause uncontrolled outages. If you do not know how your system behaves when a database connection fails, when a cloud region goes down, or when a dependency times out, you will find out during an incident — at 3 AM, under pressure.


Principles

1. Build a hypothesis about steady state
   "Order throughput remains above 100 req/s during the experiment"

2. Vary real-world events
   "Kill 1 of 3 database replicas" or "Add 200ms latency to payment API"

3. Run experiments in production
   (With appropriate blast radius controls)

4. Automate experiments to run continuously

5. Minimize blast radius
   Start small, expand as confidence grows

Experiment Design

Template

experiment:
  name: "Database failover resilience"
  hypothesis: "When the primary database fails over to a replica, 
               the application recovers within 30 seconds with 
               < 1% error rate increase."
  
  steady_state:
    - metric: error_rate
      condition: "< 0.1%"
    - metric: p99_latency
      condition: "< 500ms"
    - metric: throughput
      condition: "> 100 req/s"
  
  method:
    action: "Trigger database failover"
    tool: "AWS RDS failover API"
    duration: "30 seconds"
  
  rollback:
    automatic: true
    trigger: "error_rate > 5% for 60 seconds"
    action: "Abort experiment, failback to original primary"
  
  blast_radius:
    scope: "order-service only"
    traffic_percentage: 100  # Affects all traffic to order-service
    environment: production

Common Experiments

Infrastructure Failures

Kill a container/pod
  What happens when a service instance dies?
  Expected: Kubernetes restarts it, traffic routes to healthy instances
  
Kill a node
  What happens when an entire server dies?
  Expected: Pods reschedule to other nodes within 60 seconds
  
Network partition
  What happens when services cannot reach the database?
  Expected: Circuit breaker opens, graceful degradation
  
DNS failure
  What happens when DNS resolution fails?
  Expected: Cached DNS entries used, alert fires

Application Failures

Memory pressure
  What happens when memory usage reaches 90%?
  Expected: OOM killed, Kubernetes restarts, no data loss
  
CPU saturation
  What happens when CPU is maxed out?
  Expected: Auto-scaling triggers, latency increases but no errors
  
Clock skew
  What happens when server clock drifts by 5 minutes?
  Expected: TLS errors handled, timeouts adjusted

Dependency Failures

Add latency to external API
  What happens when payment API takes 5 seconds instead of 200ms?
  Expected: Timeout after 3 seconds, retry once, circuit breaker after 5 failures
  
Return errors from external API
  What happens when payment API returns 500 errors?
  Expected: Retry 3 times, then fail order with clear error message

Chaos Tools

ToolTypePlatform
Chaos MonkeyInstance terminationCloud
LitmusKubernetes chaosKubernetes
GremlinFull chaos platformAny
Chaos MeshKubernetes chaosKubernetes
toxiproxyNetwork fault injectionAny
tc (traffic control)Network simulationLinux

Game Days

Scheduled, team-wide chaos exercises:

Game Day Agenda (2-4 hours):
  
  Pre-game (30 min):
    - Review experiment plan
    - Verify monitoring dashboards
    - Confirm rollback procedures
    - Brief all participants
  
  Experiments (1-2 hours):
    - Run 3-5 planned experiments
    - Document observations in real-time
    - Practice incident response procedures
  
  Debrief (30-60 min):
    - What failed unexpectedly?
    - What worked better than expected?
    - What monitoring gaps did we discover?
    - Action items for resilience improvements

Anti-Patterns

Anti-PatternConsequenceFix
Chaos without observabilityCannot measure impactObservability first, then chaos
No rollback planExperiment becomes an incidentAutomated rollback on thresholds
Starting in production at full scaleUnnecessary blast radiusStart in staging, then small production scope
One-off experimentsRegression goes undetectedContinuous automated chaos
No organizational buy-inChaos seen as recklessStart with game days, build culture

Chaos engineering is about building confidence. Every experiment that passes proves a resilience claim. Every experiment that fails reveals a weakness to fix before it causes a real outage.

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →