ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

Chaos Testing Playbook

A practical guide to running chaos experiments in production. Covers experiment design, blast radius control, steady-state hypothesis, automation with tools like LitmusChaos, and the patterns that build confidence in system resilience.

Chaos testing is the practice of deliberately injecting failures into production systems to verify they handle them gracefully. The goal is not to break things — it is to discover weaknesses before they become incidents. Running controlled experiments in production is the only way to truly validate resilience.


Experiment Design

Step 1: Define Steady-State Hypothesis
  "Under normal conditions, p99 latency < 200ms
   and error rate < 0.1%"

Step 2: Introduce a Fault
  "Terminate 1 of 3 API server instances"

Step 3: Observe System Response
  Does auto-scaling replace the instance?
  Does the load balancer drain connections?
  Does latency stay within SLO?

Step 4: Record Results
  If steady state maintained → resilience confirmed ✓
  If steady state violated → vulnerability discovered ✗

Step 5: Improve and Retest
  Fix the vulnerability, then re-run the experiment
  The experiment becomes a regression test

Experiment Library

# Common chaos experiments organized by blast radius

network_experiments:
  - name: "Network Partition"
    inject: "Block traffic between service A and service B"
    hypothesis: "Service A uses cached data, returns degraded response"
    blast_radius: "Single service pair"
    
  - name: "DNS Failure"
    inject: "Return NXDOMAIN for internal service discovery"
    hypothesis: "Services use cached DNS records, retry with backoff"
    blast_radius: "Single availability zone"
    
  - name: "Latency Injection"
    inject: "Add 500ms latency to database calls"
    hypothesis: "Circuit breaker trips, fallback serves cached data"
    blast_radius: "Single dependency"

compute_experiments:
  - name: "Instance Termination"
    inject: "Kill 1 of N instances (Chaos Monkey style)"
    hypothesis: "Load balancer routes to healthy instances, auto-scaling replaces"
    blast_radius: "Single instance"
    
  - name: "CPU Stress"
    inject: "Consume 90% CPU on one instance"
    hypothesis: "Horizontal scaling activates, requests rerouted"
    blast_radius: "Single instance"

state_experiments:
  - name: "Database Failover"
    inject: "Promote read replica to primary"
    hypothesis: "Application reconnects within 30 seconds"
    blast_radius: "Data tier"
    
  - name: "Cache Flush"
    inject: "Clear Redis/Memcached cache"
    hypothesis: "Cold cache filled from database, latency spike < 5 seconds"
    blast_radius: "Caching layer"

Blast Radius Control

class ChaosExperiment:
    """Controlled chaos experiment with safety guardrails."""
    
    def run(self, experiment: dict):
        # Pre-checks
        if not self.safety_checks_pass():
            return {"status": "aborted", "reason": "Safety checks failed"}
        
        # Record baseline metrics
        baseline = self.record_steady_state()
        
        # Start experiment with kill switch
        with self.kill_switch() as kill:
            # Inject fault
            self.inject_fault(experiment["fault"])
            
            # Monitor for steady state violation
            for _ in range(experiment["duration_seconds"]):
                metrics = self.collect_metrics()
                
                if self.steady_state_violated(metrics, baseline):
                    if metrics["error_rate"] > experiment["abort_threshold"]:
                        kill.activate()  # Immediately revert
                        return {
                            "status": "aborted",
                            "reason": "Error rate exceeded abort threshold",
                            "vulnerability_discovered": True,
                        }
                
                time.sleep(1)
        
        # Revert fault and collect results
        self.revert_fault(experiment["fault"])
        return self.analyze_results(baseline)
    
    def safety_checks_pass(self):
        """Abort if system is already unhealthy."""
        return (
            self.current_error_rate() < 0.01
            and self.no_active_incidents()
            and self.within_business_hours()
            and self.change_freeze_not_active()
        )

Anti-Patterns

Anti-PatternConsequenceFix
Chaos without hypothesisRandom breaking with no learningDefine what you expect BEFORE injecting
No kill switchCannot stop experiment if things go wrongAuto-revert on threshold breach
Start with large blast radiusMajor outage from first experimentStart with smallest blast radius, expand
Only test in stagingStaging ≠ production (different load, config)Controlled production experiments
One-time chaosRegressions re-introduced silentlyAutomated chaos experiments in CI/CD

Chaos testing is a scientific method: hypothesis, experiment, observation, conclusion. The goal is not to prove the system is broken — it is to build justified confidence that the system handles failure gracefully.

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →