Chaos Engineering: Breaking Things on Purpose to Build Confidence

Chaos engineering is the discipline of experimenting on a system to build confidence in its ability to withstand turbulent conditions. It is not breaking things randomly. It is asking specific questions — “what happens when this database becomes slow?” — and running controlled experiments to find out.

The distinction matters because chaos without discipline is just bad operations. Chaos with discipline reveals the failures that are waiting to happen at 3 AM on Black Friday.

The Chaos Engineering Process

1. Define steady state
   "Our checkout flow processes 500 requests/min with p99 latency < 2s
    and error rate < 0.1%"

2. Form hypothesis
   "If we lose 30% of checkout-api pods, the system will
    auto-scale and maintain steady state within 2 minutes"

3. Design experiment
   - What: Kill 30% of checkout-api pods
   - Where: Staging environment first, then production
   - When: Business hours, low-traffic period
   - Duration: 5 minutes
   - Abort criteria: Error rate > 5% or latency > 10s

4. Run experiment
   Execute in controlled manner with team observing

5. Analyze results
   Did steady state hold? If not, what broke and why?

6. Fix and re-test
   Address weaknesses found, re-run experiment to verify

Steady State Hypothesis

The steady state is the normal operating behavior of your system. You must define this before running any experiment, or you will not know whether the system degraded.

Service	Steady State Metric	Acceptable Deviation
API Gateway	p99 latency < 500ms	Up to 800ms during experiment
Checkout	Success rate > 99.9%	No lower than 99.5%
Search	Results returned in < 1s	Up to 3s during experiment
Payment processing	Transaction success > 99.95%	No deviation acceptable (abort)

# Experiment definition
experiment:
  name: "Pod failure resilience - checkout API"
  hypothesis: "Killing 30% of checkout-api pods will not degrade
               checkout success rate below 99.5%"
  steady_state:
    metrics:
      - name: "checkout_success_rate"
        query: "sum(rate(checkout_success_total[5m])) / sum(rate(checkout_total[5m]))"
        threshold: ">= 0.995"
      - name: "checkout_p99_latency"
        query: "histogram_quantile(0.99, rate(checkout_duration_seconds_bucket[5m]))"
        threshold: "<= 2.0"
  action:
    type: "pod-kill"
    target: "deployment/checkout-api"
    percentage: 30
    namespace: "production"
  duration: "5m"
  abort_conditions:
    - "checkout_success_rate < 0.95"
    - "checkout_p99_latency > 10.0"
  rollback:
    automatic: true
    on: "abort_condition_met"

Experiment Categories

Category	What You Test	Example Experiments
Infrastructure	Node, pod, and container failures	Kill nodes, drain nodes, kill pods
Network	Network partitions, latency, packet loss	Add 500ms latency between services, block traffic to database
Application	Service degradation, dependency failure	Shut down a downstream API, corrupt cache
Resource	CPU/memory/disk pressure	Fill disk to 95%, spike CPU to 100%
State	Data corruption, clock skew	Advance system clock 24 hours, corrupt a config file

Starting Experiments (Beginner → Advanced)

Level 1: Staging only
  - Kill a single pod and verify auto-restart
  - Block access to a non-critical dependency
  - Simulate high CPU usage

Level 2: Production (controlled)
  - Kill pods during low-traffic period
  - Inject latency on internal service calls
  - Disable a cache and observe database behavior

Level 3: Production (realistic)
  - Simulate region failure (shift traffic off one region)
  - Inject packet loss on primary database connection
  - Kill a percentage of pods during peak traffic

Level 4: Automated & continuous
  - Run experiments automatically on every deployment
  - Continuous background chaos with automatic abort
  - Full-scale game days simulating major failures

Blast Radius Control

The defining characteristic of chaos engineering vs “just breaking stuff” is that the blast radius is controlled and the abort conditions are defined before the experiment starts.

BEFORE running any chaos experiment:

  ✅ Experiment runs in staging first
  ✅ Team is aware and observing
  ✅ Abort conditions are automated
  ✅ Rollback is tested and immediate
  ✅ Blast radius is limited (one service, one region, subset of pods)
  ✅ Customer-facing impact is acceptable (< defined threshold)
  ✅ Supporting team (SRE/on-call) is available and aware

  ❌ NEVER:
  ❌ Run chaos in production without fallback
  ❌ Run chaos on data stores without verified backups
  ❌ Run chaos during peak traffic without executive approval
  ❌ Run chaos that affects financial transactions without safeguards

Game Days: Structured Chaos

A game day is a scheduled event where a team deliberately introduces failures and practices responding. It is a fire drill for your systems and your people.

Game Day Agenda

30 min before:
  - Brief all participants on scenario and roles
  - Verify monitoring dashboards are visible
  - Confirm abort and rollback procedures
  - Announce game day to broader org (avoid false alarms)

During game day (2-4 hours):
  - Facilitator introduces failures one at a time
  - Team responds as if it is a real incident
  - Scribe documents timeline, decisions, and observations
  - Facilitator increases blast radius if team handles well
  - Emergency abort if real customer impact exceeds threshold

After game day:
  - Immediate debrief (30 min): what surprised us?
  - Documented findings within 24 hours
  - Action items with owners and due dates
  - Share findings with broader engineering org

Game Day Scenario Examples

Scenario	What It Tests	Expected Outcome
”Primary database is down”	Failover to replica, read-only mode	Auto-failover in < 30 seconds
”AWS us-east-1 is unavailable”	Multi-region recovery	Traffic shifts to us-west-2 in < 5 min
”Credentials have been leaked”	Secret rotation process	All secrets rotated in < 1 hour
”Payment processor is returning 503”	Graceful degradation	Users see “service temporarily unavailable” not crash
”Cache is completely empty”	Cold start / thundering herd	Database handles load without falling over

Tools

Tool	Type	Best For
Chaos Monkey (Netflix)	Random pod termination	Kubernetes pod resilience
Litmus	Kubernetes-native chaos	Kubernetes experiments with CRDs
Gremlin	Commercial platform	Enterprise teams wanting UI + support
Chaos Mesh	Kubernetes-native, CNCF	Comprehensive K8s chaos (network, IO, time)
Toxiproxy	Network chaos proxy	Simulating network conditions between services
tc (Linux)	Network manipulation	Low-level latency and packet loss injection

The Chaos Engineering Process

Steady State Hypothesis

Experiment Categories

Starting Experiments (Beginner → Advanced)

Blast Radius Control

Game Days: Structured Chaos

Game Day Agenda

Game Day Scenario Examples

Tools

Implementation Checklist

More in Site Reliability Engineering

Capacity Planning: Scaling Infrastructure Before You Need To

SRE Capacity Forecasting

Capacity Planning