Canary Deployments | The Garnet Wiki

A canary deployment routes a small percentage of production traffic to a new version while the rest continues to use the stable version. If the canary shows problems — higher error rates, increased latency, degraded business metrics — it is automatically rolled back before impact spreads. This is the safest way to deploy changes to production.

Canary vs Blue-Green vs Rolling

Blue-Green:
  100% traffic flips from old → new instantly
  Risk: If new version has bugs, 100% impacted
  
Rolling:
  Gradually replaces old instances with new
  Risk: Mixed versions during rollout, harder to rollback
  
Canary:
  1% → 5% → 25% → 50% → 100% (progressive)
  Risk: Contained. Only canary % impacted. Auto-rollback.

Traffic Splitting

# Istio canary traffic split
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: order-service
spec:
  http:
    - route:
        - destination:
            host: order-service
            subset: stable
          weight: 95
        - destination:
            host: order-service
            subset: canary
          weight: 5
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: order-service
spec:
  host: order-service
  subsets:
    - name: stable
      labels:
        version: v1.2.3
    - name: canary
      labels:
        version: v1.2.4

Flagger (Automated Canary)

# Flagger canary analysis
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: order-service
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: order-service
  
  progressDeadlineSeconds: 600
  
  service:
    port: 8080
    trafficPolicy:
      tls:
        mode: ISTIO_MUTUAL
  
  analysis:
    interval: 1m
    threshold: 5         # Max 5 failed checks before rollback
    maxWeight: 50        # Max canary traffic 50%
    stepWeight: 10       # Increase by 10% each interval
    
    metrics:
      - name: request-success-rate
        thresholdRange:
          min: 99        # 99% success rate minimum
        interval: 1m
      
      - name: request-duration
        thresholdRange:
          max: 200       # P99 latency < 200ms
        interval: 1m
    
    webhooks:
      - name: load-test
        url: http://flagger-loadtester/
        metadata:
          type: cmd
          cmd: "hey -z 1m -q 10 http://order-service-canary:8080/"

Canary Metrics to Monitor

Primary (automated decision):
  - Error rate (HTTP 5xx) — must not increase
  - Latency p50, p95, p99 — must not degrade
  - Request success rate — must stay above threshold

Secondary (human review):
  - CPU/memory usage — resource regression
  - Downstream error rate — cascading failures
  - Business metrics — conversion rate, revenue per request
  - Log error volume — new error patterns

Rollback Decision

def should_rollback(canary_metrics, stable_metrics, thresholds):
    """Automated rollback decision."""
    checks = [
        # Error rate: canary must not be worse than stable
        canary_metrics.error_rate <= stable_metrics.error_rate * 1.05,
        
        # Latency: canary p99 must not exceed threshold
        canary_metrics.latency_p99 <= thresholds.max_latency_ms,
        
        # Success rate: must stay above minimum
        canary_metrics.success_rate >= thresholds.min_success_rate,
    ]
    
    if not all(checks):
        return RollbackDecision(
            action="ROLLBACK",
            reason=f"Canary failed: error_rate={canary_metrics.error_rate:.2%}"
        )
    
    return RollbackDecision(action="CONTINUE")

Anti-Patterns

Anti-Pattern	Consequence	Fix
No automated rollback	Human must notice and react	Flagger/Argo Rollouts automated analysis
Canary too large (25%+)	Too much blast radius	Start at 1-5%, increase gradually
Wrong metrics	Canary looks fine but UX degraded	Include business metrics, not just infra
No baseline comparison	Alert on absolute thresholds only	Compare canary vs stable in real-time
Skip canary for “small” changes	Small change causes big outage	Every change goes through canary

Canary deployments are the gold standard for safe production releases. The cost of a slower rollout is always less than the cost of a full production incident.

Canary vs Blue-Green vs Rolling

Traffic Splitting

Flagger (Automated Canary)

Canary Metrics to Monitor

Rollback Decision

Anti-Patterns

More in DevOps & CI/CD

Chaos Engineering in Practice

CI/CD Pipeline Maturity Model

CI/CD Pipeline Design: From Push to Production in Minutes, Not Days