Chaos Testing Playbook | The Garnet Wiki

Chaos testing is the practice of deliberately injecting failures into production systems to verify they handle them gracefully. The goal is not to break things — it is to discover weaknesses before they become incidents. Running controlled experiments in production is the only way to truly validate resilience.

Experiment Design

Step 1: Define Steady-State Hypothesis
  "Under normal conditions, p99 latency < 200ms
   and error rate < 0.1%"

Step 2: Introduce a Fault
  "Terminate 1 of 3 API server instances"

Step 3: Observe System Response
  Does auto-scaling replace the instance?
  Does the load balancer drain connections?
  Does latency stay within SLO?

Step 4: Record Results
  If steady state maintained → resilience confirmed ✓
  If steady state violated → vulnerability discovered ✗

Step 5: Improve and Retest
  Fix the vulnerability, then re-run the experiment
  The experiment becomes a regression test

Experiment Library

# Common chaos experiments organized by blast radius

network_experiments:
  - name: "Network Partition"
    inject: "Block traffic between service A and service B"
    hypothesis: "Service A uses cached data, returns degraded response"
    blast_radius: "Single service pair"
    
  - name: "DNS Failure"
    inject: "Return NXDOMAIN for internal service discovery"
    hypothesis: "Services use cached DNS records, retry with backoff"
    blast_radius: "Single availability zone"
    
  - name: "Latency Injection"
    inject: "Add 500ms latency to database calls"
    hypothesis: "Circuit breaker trips, fallback serves cached data"
    blast_radius: "Single dependency"

compute_experiments:
  - name: "Instance Termination"
    inject: "Kill 1 of N instances (Chaos Monkey style)"
    hypothesis: "Load balancer routes to healthy instances, auto-scaling replaces"
    blast_radius: "Single instance"
    
  - name: "CPU Stress"
    inject: "Consume 90% CPU on one instance"
    hypothesis: "Horizontal scaling activates, requests rerouted"
    blast_radius: "Single instance"

state_experiments:
  - name: "Database Failover"
    inject: "Promote read replica to primary"
    hypothesis: "Application reconnects within 30 seconds"
    blast_radius: "Data tier"
    
  - name: "Cache Flush"
    inject: "Clear Redis/Memcached cache"
    hypothesis: "Cold cache filled from database, latency spike < 5 seconds"
    blast_radius: "Caching layer"

Blast Radius Control

class ChaosExperiment:
    """Controlled chaos experiment with safety guardrails."""
    
    def run(self, experiment: dict):
        # Pre-checks
        if not self.safety_checks_pass():
            return {"status": "aborted", "reason": "Safety checks failed"}
        
        # Record baseline metrics
        baseline = self.record_steady_state()
        
        # Start experiment with kill switch
        with self.kill_switch() as kill:
            # Inject fault
            self.inject_fault(experiment["fault"])
            
            # Monitor for steady state violation
            for _ in range(experiment["duration_seconds"]):
                metrics = self.collect_metrics()
                
                if self.steady_state_violated(metrics, baseline):
                    if metrics["error_rate"] > experiment["abort_threshold"]:
                        kill.activate()  # Immediately revert
                        return {
                            "status": "aborted",
                            "reason": "Error rate exceeded abort threshold",
                            "vulnerability_discovered": True,
                        }
                
                time.sleep(1)
        
        # Revert fault and collect results
        self.revert_fault(experiment["fault"])
        return self.analyze_results(baseline)
    
    def safety_checks_pass(self):
        """Abort if system is already unhealthy."""
        return (
            self.current_error_rate() < 0.01
            and self.no_active_incidents()
            and self.within_business_hours()
            and self.change_freeze_not_active()
        )

Anti-Patterns

Anti-Pattern	Consequence	Fix
Chaos without hypothesis	Random breaking with no learning	Define what you expect BEFORE injecting
No kill switch	Cannot stop experiment if things go wrong	Auto-revert on threshold breach
Start with large blast radius	Major outage from first experiment	Start with smallest blast radius, expand
Only test in staging	Staging ≠ production (different load, config)	Controlled production experiments
One-time chaos	Regressions re-introduced silently	Automated chaos experiments in CI/CD

Chaos testing is a scientific method: hypothesis, experiment, observation, conclusion. The goal is not to prove the system is broken — it is to build justified confidence that the system handles failure gracefully.

Experiment Design

Experiment Library

Blast Radius Control

Anti-Patterns

More in Testing & QA

Accessibility Testing

API Testing Strategy

Chaos Testing Maturity