Chaos Engineering | The Garnet Wiki

Chaos engineering is not breaking things for fun. It is the disciplined practice of injecting controlled failures into production systems to discover weaknesses before they cause uncontrolled outages. If you do not know how your system behaves when a database connection fails, when a cloud region goes down, or when a dependency times out, you will find out during an incident — at 3 AM, under pressure.

Principles

1. Build a hypothesis about steady state
   "Order throughput remains above 100 req/s during the experiment"

2. Vary real-world events
   "Kill 1 of 3 database replicas" or "Add 200ms latency to payment API"

3. Run experiments in production
   (With appropriate blast radius controls)

4. Automate experiments to run continuously

5. Minimize blast radius
   Start small, expand as confidence grows

Experiment Design

Template

experiment:
  name: "Database failover resilience"
  hypothesis: "When the primary database fails over to a replica, 
               the application recovers within 30 seconds with 
               < 1% error rate increase."
  
  steady_state:
    - metric: error_rate
      condition: "< 0.1%"
    - metric: p99_latency
      condition: "< 500ms"
    - metric: throughput
      condition: "> 100 req/s"
  
  method:
    action: "Trigger database failover"
    tool: "AWS RDS failover API"
    duration: "30 seconds"
  
  rollback:
    automatic: true
    trigger: "error_rate > 5% for 60 seconds"
    action: "Abort experiment, failback to original primary"
  
  blast_radius:
    scope: "order-service only"
    traffic_percentage: 100  # Affects all traffic to order-service
    environment: production

Common Experiments

Infrastructure Failures

Kill a container/pod
  What happens when a service instance dies?
  Expected: Kubernetes restarts it, traffic routes to healthy instances
  
Kill a node
  What happens when an entire server dies?
  Expected: Pods reschedule to other nodes within 60 seconds
  
Network partition
  What happens when services cannot reach the database?
  Expected: Circuit breaker opens, graceful degradation
  
DNS failure
  What happens when DNS resolution fails?
  Expected: Cached DNS entries used, alert fires

Application Failures

Memory pressure
  What happens when memory usage reaches 90%?
  Expected: OOM killed, Kubernetes restarts, no data loss
  
CPU saturation
  What happens when CPU is maxed out?
  Expected: Auto-scaling triggers, latency increases but no errors
  
Clock skew
  What happens when server clock drifts by 5 minutes?
  Expected: TLS errors handled, timeouts adjusted

Dependency Failures

Add latency to external API
  What happens when payment API takes 5 seconds instead of 200ms?
  Expected: Timeout after 3 seconds, retry once, circuit breaker after 5 failures
  
Return errors from external API
  What happens when payment API returns 500 errors?
  Expected: Retry 3 times, then fail order with clear error message

Chaos Tools

Tool	Type	Platform
Chaos Monkey	Instance termination	Cloud
Litmus	Kubernetes chaos	Kubernetes
Gremlin	Full chaos platform	Any
Chaos Mesh	Kubernetes chaos	Kubernetes
toxiproxy	Network fault injection	Any
tc (traffic control)	Network simulation	Linux

Game Days

Scheduled, team-wide chaos exercises:

Game Day Agenda (2-4 hours):
  
  Pre-game (30 min):
    - Review experiment plan
    - Verify monitoring dashboards
    - Confirm rollback procedures
    - Brief all participants
  
  Experiments (1-2 hours):
    - Run 3-5 planned experiments
    - Document observations in real-time
    - Practice incident response procedures
  
  Debrief (30-60 min):
    - What failed unexpectedly?
    - What worked better than expected?
    - What monitoring gaps did we discover?
    - Action items for resilience improvements

Anti-Patterns

Anti-Pattern	Consequence	Fix
Chaos without observability	Cannot measure impact	Observability first, then chaos
No rollback plan	Experiment becomes an incident	Automated rollback on thresholds
Starting in production at full scale	Unnecessary blast radius	Start in staging, then small production scope
One-off experiments	Regression goes undetected	Continuous automated chaos
No organizational buy-in	Chaos seen as reckless	Start with game days, build culture

Chaos engineering is about building confidence. Every experiment that passes proves a resilience claim. Every experiment that fails reveals a weakness to fix before it causes a real outage.