Chaos Testing Playbook
A practical guide to running chaos experiments in production. Covers experiment design, blast radius control, steady-state hypothesis, automation with tools like LitmusChaos, and the patterns that build confidence in system resilience.
Chaos testing is the practice of deliberately injecting failures into production systems to verify they handle them gracefully. The goal is not to break things — it is to discover weaknesses before they become incidents. Running controlled experiments in production is the only way to truly validate resilience.
Experiment Design
Step 1: Define Steady-State Hypothesis
"Under normal conditions, p99 latency < 200ms
and error rate < 0.1%"
Step 2: Introduce a Fault
"Terminate 1 of 3 API server instances"
Step 3: Observe System Response
Does auto-scaling replace the instance?
Does the load balancer drain connections?
Does latency stay within SLO?
Step 4: Record Results
If steady state maintained → resilience confirmed ✓
If steady state violated → vulnerability discovered ✗
Step 5: Improve and Retest
Fix the vulnerability, then re-run the experiment
The experiment becomes a regression test
Experiment Library
# Common chaos experiments organized by blast radius
network_experiments:
- name: "Network Partition"
inject: "Block traffic between service A and service B"
hypothesis: "Service A uses cached data, returns degraded response"
blast_radius: "Single service pair"
- name: "DNS Failure"
inject: "Return NXDOMAIN for internal service discovery"
hypothesis: "Services use cached DNS records, retry with backoff"
blast_radius: "Single availability zone"
- name: "Latency Injection"
inject: "Add 500ms latency to database calls"
hypothesis: "Circuit breaker trips, fallback serves cached data"
blast_radius: "Single dependency"
compute_experiments:
- name: "Instance Termination"
inject: "Kill 1 of N instances (Chaos Monkey style)"
hypothesis: "Load balancer routes to healthy instances, auto-scaling replaces"
blast_radius: "Single instance"
- name: "CPU Stress"
inject: "Consume 90% CPU on one instance"
hypothesis: "Horizontal scaling activates, requests rerouted"
blast_radius: "Single instance"
state_experiments:
- name: "Database Failover"
inject: "Promote read replica to primary"
hypothesis: "Application reconnects within 30 seconds"
blast_radius: "Data tier"
- name: "Cache Flush"
inject: "Clear Redis/Memcached cache"
hypothesis: "Cold cache filled from database, latency spike < 5 seconds"
blast_radius: "Caching layer"
Blast Radius Control
class ChaosExperiment:
"""Controlled chaos experiment with safety guardrails."""
def run(self, experiment: dict):
# Pre-checks
if not self.safety_checks_pass():
return {"status": "aborted", "reason": "Safety checks failed"}
# Record baseline metrics
baseline = self.record_steady_state()
# Start experiment with kill switch
with self.kill_switch() as kill:
# Inject fault
self.inject_fault(experiment["fault"])
# Monitor for steady state violation
for _ in range(experiment["duration_seconds"]):
metrics = self.collect_metrics()
if self.steady_state_violated(metrics, baseline):
if metrics["error_rate"] > experiment["abort_threshold"]:
kill.activate() # Immediately revert
return {
"status": "aborted",
"reason": "Error rate exceeded abort threshold",
"vulnerability_discovered": True,
}
time.sleep(1)
# Revert fault and collect results
self.revert_fault(experiment["fault"])
return self.analyze_results(baseline)
def safety_checks_pass(self):
"""Abort if system is already unhealthy."""
return (
self.current_error_rate() < 0.01
and self.no_active_incidents()
and self.within_business_hours()
and self.change_freeze_not_active()
)
Anti-Patterns
| Anti-Pattern | Consequence | Fix |
|---|---|---|
| Chaos without hypothesis | Random breaking with no learning | Define what you expect BEFORE injecting |
| No kill switch | Cannot stop experiment if things go wrong | Auto-revert on threshold breach |
| Start with large blast radius | Major outage from first experiment | Start with smallest blast radius, expand |
| Only test in staging | Staging ≠ production (different load, config) | Controlled production experiments |
| One-time chaos | Regressions re-introduced silently | Automated chaos experiments in CI/CD |
Chaos testing is a scientific method: hypothesis, experiment, observation, conclusion. The goal is not to prove the system is broken — it is to build justified confidence that the system handles failure gracefully.