ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

Chaos Engineering: Breaking Things on Purpose to Build Confidence

Implement chaos engineering practices that strengthen your systems without causing real outages. Covers experiment design, blast radius control, steady state hypothesis, game day facilitation, and the maturity model from ad-hoc chaos to continuous resilience verification.

Chaos engineering is the discipline of experimenting on a system to build confidence in its ability to withstand turbulent conditions. It is not breaking things randomly. It is asking specific questions — “what happens when this database becomes slow?” — and running controlled experiments to find out.

The distinction matters because chaos without discipline is just bad operations. Chaos with discipline reveals the failures that are waiting to happen at 3 AM on Black Friday.


The Chaos Engineering Process

1. Define steady state
   "Our checkout flow processes 500 requests/min with p99 latency < 2s
    and error rate < 0.1%"

2. Form hypothesis
   "If we lose 30% of checkout-api pods, the system will
    auto-scale and maintain steady state within 2 minutes"

3. Design experiment
   - What: Kill 30% of checkout-api pods
   - Where: Staging environment first, then production
   - When: Business hours, low-traffic period
   - Duration: 5 minutes
   - Abort criteria: Error rate > 5% or latency > 10s

4. Run experiment
   Execute in controlled manner with team observing

5. Analyze results
   Did steady state hold? If not, what broke and why?

6. Fix and re-test
   Address weaknesses found, re-run experiment to verify

Steady State Hypothesis

The steady state is the normal operating behavior of your system. You must define this before running any experiment, or you will not know whether the system degraded.

ServiceSteady State MetricAcceptable Deviation
API Gatewayp99 latency < 500msUp to 800ms during experiment
CheckoutSuccess rate > 99.9%No lower than 99.5%
SearchResults returned in < 1sUp to 3s during experiment
Payment processingTransaction success > 99.95%No deviation acceptable (abort)
# Experiment definition
experiment:
  name: "Pod failure resilience - checkout API"
  hypothesis: "Killing 30% of checkout-api pods will not degrade
               checkout success rate below 99.5%"
  steady_state:
    metrics:
      - name: "checkout_success_rate"
        query: "sum(rate(checkout_success_total[5m])) / sum(rate(checkout_total[5m]))"
        threshold: ">= 0.995"
      - name: "checkout_p99_latency"
        query: "histogram_quantile(0.99, rate(checkout_duration_seconds_bucket[5m]))"
        threshold: "<= 2.0"
  action:
    type: "pod-kill"
    target: "deployment/checkout-api"
    percentage: 30
    namespace: "production"
  duration: "5m"
  abort_conditions:
    - "checkout_success_rate < 0.95"
    - "checkout_p99_latency > 10.0"
  rollback:
    automatic: true
    on: "abort_condition_met"

Experiment Categories

CategoryWhat You TestExample Experiments
InfrastructureNode, pod, and container failuresKill nodes, drain nodes, kill pods
NetworkNetwork partitions, latency, packet lossAdd 500ms latency between services, block traffic to database
ApplicationService degradation, dependency failureShut down a downstream API, corrupt cache
ResourceCPU/memory/disk pressureFill disk to 95%, spike CPU to 100%
StateData corruption, clock skewAdvance system clock 24 hours, corrupt a config file

Starting Experiments (Beginner → Advanced)

Level 1: Staging only
  - Kill a single pod and verify auto-restart
  - Block access to a non-critical dependency
  - Simulate high CPU usage

Level 2: Production (controlled)
  - Kill pods during low-traffic period
  - Inject latency on internal service calls
  - Disable a cache and observe database behavior

Level 3: Production (realistic)
  - Simulate region failure (shift traffic off one region)
  - Inject packet loss on primary database connection
  - Kill a percentage of pods during peak traffic

Level 4: Automated & continuous
  - Run experiments automatically on every deployment
  - Continuous background chaos with automatic abort
  - Full-scale game days simulating major failures

Blast Radius Control

The defining characteristic of chaos engineering vs “just breaking stuff” is that the blast radius is controlled and the abort conditions are defined before the experiment starts.

BEFORE running any chaos experiment:

  ✅ Experiment runs in staging first
  ✅ Team is aware and observing
  ✅ Abort conditions are automated
  ✅ Rollback is tested and immediate
  ✅ Blast radius is limited (one service, one region, subset of pods)
  ✅ Customer-facing impact is acceptable (< defined threshold)
  ✅ Supporting team (SRE/on-call) is available and aware

  ❌ NEVER:
  ❌ Run chaos in production without fallback
  ❌ Run chaos on data stores without verified backups
  ❌ Run chaos during peak traffic without executive approval
  ❌ Run chaos that affects financial transactions without safeguards

Game Days: Structured Chaos

A game day is a scheduled event where a team deliberately introduces failures and practices responding. It is a fire drill for your systems and your people.

Game Day Agenda

30 min before:
  - Brief all participants on scenario and roles
  - Verify monitoring dashboards are visible
  - Confirm abort and rollback procedures
  - Announce game day to broader org (avoid false alarms)

During game day (2-4 hours):
  - Facilitator introduces failures one at a time
  - Team responds as if it is a real incident
  - Scribe documents timeline, decisions, and observations
  - Facilitator increases blast radius if team handles well
  - Emergency abort if real customer impact exceeds threshold

After game day:
  - Immediate debrief (30 min): what surprised us?
  - Documented findings within 24 hours
  - Action items with owners and due dates
  - Share findings with broader engineering org

Game Day Scenario Examples

ScenarioWhat It TestsExpected Outcome
”Primary database is down”Failover to replica, read-only modeAuto-failover in < 30 seconds
”AWS us-east-1 is unavailable”Multi-region recoveryTraffic shifts to us-west-2 in < 5 min
”Credentials have been leaked”Secret rotation processAll secrets rotated in < 1 hour
”Payment processor is returning 503”Graceful degradationUsers see “service temporarily unavailable” not crash
”Cache is completely empty”Cold start / thundering herdDatabase handles load without falling over

Tools

ToolTypeBest For
Chaos Monkey (Netflix)Random pod terminationKubernetes pod resilience
LitmusKubernetes-native chaosKubernetes experiments with CRDs
GremlinCommercial platformEnterprise teams wanting UI + support
Chaos MeshKubernetes-native, CNCFComprehensive K8s chaos (network, IO, time)
ToxiproxyNetwork chaos proxySimulating network conditions between services
tc (Linux)Network manipulationLow-level latency and packet loss injection

Implementation Checklist

  • Start with steady state definition for your top 3 critical services
  • Run your first experiment in staging: kill a single pod, verify auto-restart
  • Document abort conditions for every experiment before running it
  • Schedule your first game day (pick a low-risk scenario)
  • Implement automated abort: if error rate spikes, chaos stops automatically
  • Include chaos experiments in post-incident action items (test the fix)
  • Progress through maturity levels: staging → production controlled → production automated
  • Train at least 3 engineers in game day facilitation
  • Share game day findings organization-wide (build chaos culture)
  • Integrate chaos tests into deployment pipeline for critical services
Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →