ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

Chaos Engineering Framework

Build confidence in system resilience through controlled failure experiments. Covers chaos experiment design, blast radius control, steady state hypothesis, game days, chaos in production, and the patterns that turn unknown failure modes into known, handled scenarios.

Chaos engineering is not about breaking things. It is about discovering how things are already broken — latent failures, untested assumptions, missing fallbacks — before they cause outages. Every production system has failure modes its operators have never seen. Chaos engineering makes the unknown known.


Principles of Chaos

1. Define steady state:
   What does "normal" look like?
   Metrics: Error rate < 0.1%, latency P99 < 500ms, throughput > 1000 rps

2. Hypothesize:
   "If we kill service X, the system will gracefully degrade
   and users will not see errors."

3. Introduce a real-world failure:
   Kill a service, inject latency, corrupt data, exhaust CPU

4. Observe:
   Did the system behave as hypothesized?
   Were alerts triggered? Did fallbacks activate? Did users notice?

5. Learn:
   If hypothesis confirmed: Good — document and move on
   If hypothesis disproven: Found a bug! Fix it and re-test

Experiment Design

class ChaosExperiment:
    def __init__(self, name, target, blast_radius="low"):
        self.name = name
        self.target = target
        self.blast_radius = blast_radius
        self.rollback_plan = None
    
    def define_steady_state(self):
        """Capture metrics that define 'normal'."""
        return {
            "error_rate": self.metrics.get_error_rate(),  # e.g., 0.05%
            "p99_latency": self.metrics.get_p99_latency(),  # e.g., 200ms
            "throughput": self.metrics.get_throughput(),  # e.g., 1500 rps
            "health_checks": self.health.get_all_status(),  # e.g., all green
        }
    
    def hypothesis(self):
        return (
            f"When {self.target} fails, the system will continue serving "
            f"requests with error rate < 1% and latency P99 < 2000ms. "
            f"Fallback behavior should activate within 30 seconds."
        )
    
    def run(self):
        """Execute the chaos experiment."""
        # 1. Capture steady state
        before = self.define_steady_state()
        
        # 2. Inject failure
        failure = self.inject_failure()
        
        # 3. Observe for duration
        observations = self.observe(duration_minutes=5)
        
        # 4. Check abort conditions (auto-rollback if exceeded)
        for obs in observations:
            if obs.error_rate > 5.0:  # Hard stop at 5% errors
                self.rollback(failure)
                return ExperimentResult(
                    status="ABORTED",
                    reason="Error rate exceeded 5%",
                )
        
        # 5. Rollback
        self.rollback(failure)
        
        # 6. Capture steady state after
        after = self.define_steady_state()
        
        # 7. Compare and report
        return ExperimentResult(
            hypothesis_confirmed=self.evaluate(before, observations, after),
            steady_state_before=before,
            steady_state_after=after,
            observations=observations,
        )

Experiment Types

Instance failure:
  Tool: AWS FIS, Gremlin, LitmusChaos
  Test: Kill a container, VM, or pod
  Validates: Auto-scaling, self-healing, load balancing

Network failure:
  Tool: tc (traffic control), toxiproxy
  Test: Add latency, packet loss, partition
  Validates: Timeouts, retries, circuit breakers

Dependency failure:
  Tool: Service mesh fault injection
  Test: Database timeout, API 500, queue full
  Validates: Fallbacks, graceful degradation, caching

Resource exhaustion:
  Tool: stress-ng, custom scripts
  Test: CPU spike, memory pressure, disk full
  Validates: Resource limits, OOM handling, alerting

DNS failure:
  Tool: iptables, dnsmasq
  Test: DNS resolution failure for downstream service
  Validates: DNS caching, connection pooling, retries

Anti-Patterns

Anti-PatternConsequenceFix
Chaos without hypothesisRandom breaking with no learningDefine hypothesis + success criteria before every experiment
No blast radius controlExperiment takes down productionStart small, expand gradually
Skip non-production firstFind issues in prod instead of stagingStaging → canary → production progression
No automated rollbackExperiment exceeds blast radiusHard abort conditions with auto-restore
Run once, never againSystems change, old experiments staleRegular game days, chaos as CI/CD

Chaos engineering is an investment in confidence. Every experiment that confirms your hypothesis validates your architecture. Every experiment that fails reveals a gap you can fix before your users find it.

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →