Chaos Engineering Framework

Chaos engineering is not about breaking things. It is about discovering how things are already broken — latent failures, untested assumptions, missing fallbacks — before they cause outages. Every production system has failure modes its operators have never seen. Chaos engineering makes the unknown known.

Principles of Chaos

1. Define steady state:
   What does "normal" look like?
   Metrics: Error rate < 0.1%, latency P99 < 500ms, throughput > 1000 rps

2. Hypothesize:
   "If we kill service X, the system will gracefully degrade
   and users will not see errors."

3. Introduce a real-world failure:
   Kill a service, inject latency, corrupt data, exhaust CPU

4. Observe:
   Did the system behave as hypothesized?
   Were alerts triggered? Did fallbacks activate? Did users notice?

5. Learn:
   If hypothesis confirmed: Good — document and move on
   If hypothesis disproven: Found a bug! Fix it and re-test

Experiment Design

class ChaosExperiment:
    def __init__(self, name, target, blast_radius="low"):
        self.name = name
        self.target = target
        self.blast_radius = blast_radius
        self.rollback_plan = None
    
    def define_steady_state(self):
        """Capture metrics that define 'normal'."""
        return {
            "error_rate": self.metrics.get_error_rate(),  # e.g., 0.05%
            "p99_latency": self.metrics.get_p99_latency(),  # e.g., 200ms
            "throughput": self.metrics.get_throughput(),  # e.g., 1500 rps
            "health_checks": self.health.get_all_status(),  # e.g., all green
        }
    
    def hypothesis(self):
        return (
            f"When {self.target} fails, the system will continue serving "
            f"requests with error rate < 1% and latency P99 < 2000ms. "
            f"Fallback behavior should activate within 30 seconds."
        )
    
    def run(self):
        """Execute the chaos experiment."""
        # 1. Capture steady state
        before = self.define_steady_state()
        
        # 2. Inject failure
        failure = self.inject_failure()
        
        # 3. Observe for duration
        observations = self.observe(duration_minutes=5)
        
        # 4. Check abort conditions (auto-rollback if exceeded)
        for obs in observations:
            if obs.error_rate > 5.0:  # Hard stop at 5% errors
                self.rollback(failure)
                return ExperimentResult(
                    status="ABORTED",
                    reason="Error rate exceeded 5%",
                )
        
        # 5. Rollback
        self.rollback(failure)
        
        # 6. Capture steady state after
        after = self.define_steady_state()
        
        # 7. Compare and report
        return ExperimentResult(
            hypothesis_confirmed=self.evaluate(before, observations, after),
            steady_state_before=before,
            steady_state_after=after,
            observations=observations,
        )

Experiment Types

Instance failure:
  Tool: AWS FIS, Gremlin, LitmusChaos
  Test: Kill a container, VM, or pod
  Validates: Auto-scaling, self-healing, load balancing

Network failure:
  Tool: tc (traffic control), toxiproxy
  Test: Add latency, packet loss, partition
  Validates: Timeouts, retries, circuit breakers

Dependency failure:
  Tool: Service mesh fault injection
  Test: Database timeout, API 500, queue full
  Validates: Fallbacks, graceful degradation, caching

Resource exhaustion:
  Tool: stress-ng, custom scripts
  Test: CPU spike, memory pressure, disk full
  Validates: Resource limits, OOM handling, alerting

DNS failure:
  Tool: iptables, dnsmasq
  Test: DNS resolution failure for downstream service
  Validates: DNS caching, connection pooling, retries

Anti-Patterns

Anti-Pattern	Consequence	Fix
Chaos without hypothesis	Random breaking with no learning	Define hypothesis + success criteria before every experiment
No blast radius control	Experiment takes down production	Start small, expand gradually
Skip non-production first	Find issues in prod instead of staging	Staging → canary → production progression
No automated rollback	Experiment exceeds blast radius	Hard abort conditions with auto-restore
Run once, never again	Systems change, old experiments stale	Regular game days, chaos as CI/CD

Chaos engineering is an investment in confidence. Every experiment that confirms your hypothesis validates your architecture. Every experiment that fails reveals a gap you can fix before your users find it.

Principles of Chaos

Experiment Design

Experiment Types

Anti-Patterns

More in Site Reliability Engineering

Capacity Planning: Scaling Infrastructure Before You Need To

SRE Capacity Forecasting

Capacity Planning