Chaos Engineering Framework
Build confidence in system resilience through controlled failure experiments. Covers chaos experiment design, blast radius control, steady state hypothesis, game days, chaos in production, and the patterns that turn unknown failure modes into known, handled scenarios.
Chaos engineering is not about breaking things. It is about discovering how things are already broken — latent failures, untested assumptions, missing fallbacks — before they cause outages. Every production system has failure modes its operators have never seen. Chaos engineering makes the unknown known.
Principles of Chaos
1. Define steady state:
What does "normal" look like?
Metrics: Error rate < 0.1%, latency P99 < 500ms, throughput > 1000 rps
2. Hypothesize:
"If we kill service X, the system will gracefully degrade
and users will not see errors."
3. Introduce a real-world failure:
Kill a service, inject latency, corrupt data, exhaust CPU
4. Observe:
Did the system behave as hypothesized?
Were alerts triggered? Did fallbacks activate? Did users notice?
5. Learn:
If hypothesis confirmed: Good — document and move on
If hypothesis disproven: Found a bug! Fix it and re-test
Experiment Design
class ChaosExperiment:
def __init__(self, name, target, blast_radius="low"):
self.name = name
self.target = target
self.blast_radius = blast_radius
self.rollback_plan = None
def define_steady_state(self):
"""Capture metrics that define 'normal'."""
return {
"error_rate": self.metrics.get_error_rate(), # e.g., 0.05%
"p99_latency": self.metrics.get_p99_latency(), # e.g., 200ms
"throughput": self.metrics.get_throughput(), # e.g., 1500 rps
"health_checks": self.health.get_all_status(), # e.g., all green
}
def hypothesis(self):
return (
f"When {self.target} fails, the system will continue serving "
f"requests with error rate < 1% and latency P99 < 2000ms. "
f"Fallback behavior should activate within 30 seconds."
)
def run(self):
"""Execute the chaos experiment."""
# 1. Capture steady state
before = self.define_steady_state()
# 2. Inject failure
failure = self.inject_failure()
# 3. Observe for duration
observations = self.observe(duration_minutes=5)
# 4. Check abort conditions (auto-rollback if exceeded)
for obs in observations:
if obs.error_rate > 5.0: # Hard stop at 5% errors
self.rollback(failure)
return ExperimentResult(
status="ABORTED",
reason="Error rate exceeded 5%",
)
# 5. Rollback
self.rollback(failure)
# 6. Capture steady state after
after = self.define_steady_state()
# 7. Compare and report
return ExperimentResult(
hypothesis_confirmed=self.evaluate(before, observations, after),
steady_state_before=before,
steady_state_after=after,
observations=observations,
)
Experiment Types
Instance failure:
Tool: AWS FIS, Gremlin, LitmusChaos
Test: Kill a container, VM, or pod
Validates: Auto-scaling, self-healing, load balancing
Network failure:
Tool: tc (traffic control), toxiproxy
Test: Add latency, packet loss, partition
Validates: Timeouts, retries, circuit breakers
Dependency failure:
Tool: Service mesh fault injection
Test: Database timeout, API 500, queue full
Validates: Fallbacks, graceful degradation, caching
Resource exhaustion:
Tool: stress-ng, custom scripts
Test: CPU spike, memory pressure, disk full
Validates: Resource limits, OOM handling, alerting
DNS failure:
Tool: iptables, dnsmasq
Test: DNS resolution failure for downstream service
Validates: DNS caching, connection pooling, retries
Anti-Patterns
| Anti-Pattern | Consequence | Fix |
|---|---|---|
| Chaos without hypothesis | Random breaking with no learning | Define hypothesis + success criteria before every experiment |
| No blast radius control | Experiment takes down production | Start small, expand gradually |
| Skip non-production first | Find issues in prod instead of staging | Staging → canary → production progression |
| No automated rollback | Experiment exceeds blast radius | Hard abort conditions with auto-restore |
| Run once, never again | Systems change, old experiments stale | Regular game days, chaos as CI/CD |
Chaos engineering is an investment in confidence. Every experiment that confirms your hypothesis validates your architecture. Every experiment that fails reveals a gap you can fix before your users find it.