Canary Deployments
Deploy with confidence using canary releases that route a fraction of traffic to new versions. Covers traffic splitting, metrics comparison, automated rollback, progressive delivery controllers, and the patterns that catch regressions before they reach all users.
A canary deployment routes a small percentage of production traffic to a new version while the rest continues to use the stable version. If the canary shows problems — higher error rates, increased latency, degraded business metrics — it is automatically rolled back before impact spreads. This is the safest way to deploy changes to production.
Canary vs Blue-Green vs Rolling
Blue-Green:
100% traffic flips from old → new instantly
Risk: If new version has bugs, 100% impacted
Rolling:
Gradually replaces old instances with new
Risk: Mixed versions during rollout, harder to rollback
Canary:
1% → 5% → 25% → 50% → 100% (progressive)
Risk: Contained. Only canary % impacted. Auto-rollback.
Traffic Splitting
# Istio canary traffic split
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: order-service
spec:
http:
- route:
- destination:
host: order-service
subset: stable
weight: 95
- destination:
host: order-service
subset: canary
weight: 5
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: order-service
spec:
host: order-service
subsets:
- name: stable
labels:
version: v1.2.3
- name: canary
labels:
version: v1.2.4
Flagger (Automated Canary)
# Flagger canary analysis
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: order-service
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: order-service
progressDeadlineSeconds: 600
service:
port: 8080
trafficPolicy:
tls:
mode: ISTIO_MUTUAL
analysis:
interval: 1m
threshold: 5 # Max 5 failed checks before rollback
maxWeight: 50 # Max canary traffic 50%
stepWeight: 10 # Increase by 10% each interval
metrics:
- name: request-success-rate
thresholdRange:
min: 99 # 99% success rate minimum
interval: 1m
- name: request-duration
thresholdRange:
max: 200 # P99 latency < 200ms
interval: 1m
webhooks:
- name: load-test
url: http://flagger-loadtester/
metadata:
type: cmd
cmd: "hey -z 1m -q 10 http://order-service-canary:8080/"
Canary Metrics to Monitor
Primary (automated decision):
- Error rate (HTTP 5xx) — must not increase
- Latency p50, p95, p99 — must not degrade
- Request success rate — must stay above threshold
Secondary (human review):
- CPU/memory usage — resource regression
- Downstream error rate — cascading failures
- Business metrics — conversion rate, revenue per request
- Log error volume — new error patterns
Rollback Decision
def should_rollback(canary_metrics, stable_metrics, thresholds):
"""Automated rollback decision."""
checks = [
# Error rate: canary must not be worse than stable
canary_metrics.error_rate <= stable_metrics.error_rate * 1.05,
# Latency: canary p99 must not exceed threshold
canary_metrics.latency_p99 <= thresholds.max_latency_ms,
# Success rate: must stay above minimum
canary_metrics.success_rate >= thresholds.min_success_rate,
]
if not all(checks):
return RollbackDecision(
action="ROLLBACK",
reason=f"Canary failed: error_rate={canary_metrics.error_rate:.2%}"
)
return RollbackDecision(action="CONTINUE")
Anti-Patterns
| Anti-Pattern | Consequence | Fix |
|---|---|---|
| No automated rollback | Human must notice and react | Flagger/Argo Rollouts automated analysis |
| Canary too large (25%+) | Too much blast radius | Start at 1-5%, increase gradually |
| Wrong metrics | Canary looks fine but UX degraded | Include business metrics, not just infra |
| No baseline comparison | Alert on absolute thresholds only | Compare canary vs stable in real-time |
| Skip canary for “small” changes | Small change causes big outage | Every change goes through canary |
Canary deployments are the gold standard for safe production releases. The cost of a slower rollout is always less than the cost of a full production incident.