ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

Canary Deployments

Deploy with confidence using canary releases that route a fraction of traffic to new versions. Covers traffic splitting, metrics comparison, automated rollback, progressive delivery controllers, and the patterns that catch regressions before they reach all users.

A canary deployment routes a small percentage of production traffic to a new version while the rest continues to use the stable version. If the canary shows problems — higher error rates, increased latency, degraded business metrics — it is automatically rolled back before impact spreads. This is the safest way to deploy changes to production.


Canary vs Blue-Green vs Rolling

Blue-Green:
  100% traffic flips from old → new instantly
  Risk: If new version has bugs, 100% impacted
  
Rolling:
  Gradually replaces old instances with new
  Risk: Mixed versions during rollout, harder to rollback
  
Canary:
  1% → 5% → 25% → 50% → 100% (progressive)
  Risk: Contained. Only canary % impacted. Auto-rollback.

Traffic Splitting

# Istio canary traffic split
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: order-service
spec:
  http:
    - route:
        - destination:
            host: order-service
            subset: stable
          weight: 95
        - destination:
            host: order-service
            subset: canary
          weight: 5
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: order-service
spec:
  host: order-service
  subsets:
    - name: stable
      labels:
        version: v1.2.3
    - name: canary
      labels:
        version: v1.2.4

Flagger (Automated Canary)

# Flagger canary analysis
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: order-service
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: order-service
  
  progressDeadlineSeconds: 600
  
  service:
    port: 8080
    trafficPolicy:
      tls:
        mode: ISTIO_MUTUAL
  
  analysis:
    interval: 1m
    threshold: 5         # Max 5 failed checks before rollback
    maxWeight: 50        # Max canary traffic 50%
    stepWeight: 10       # Increase by 10% each interval
    
    metrics:
      - name: request-success-rate
        thresholdRange:
          min: 99        # 99% success rate minimum
        interval: 1m
      
      - name: request-duration
        thresholdRange:
          max: 200       # P99 latency < 200ms
        interval: 1m
    
    webhooks:
      - name: load-test
        url: http://flagger-loadtester/
        metadata:
          type: cmd
          cmd: "hey -z 1m -q 10 http://order-service-canary:8080/"

Canary Metrics to Monitor

Primary (automated decision):
  - Error rate (HTTP 5xx) — must not increase
  - Latency p50, p95, p99 — must not degrade
  - Request success rate — must stay above threshold

Secondary (human review):
  - CPU/memory usage — resource regression
  - Downstream error rate — cascading failures
  - Business metrics — conversion rate, revenue per request
  - Log error volume — new error patterns

Rollback Decision

def should_rollback(canary_metrics, stable_metrics, thresholds):
    """Automated rollback decision."""
    checks = [
        # Error rate: canary must not be worse than stable
        canary_metrics.error_rate <= stable_metrics.error_rate * 1.05,
        
        # Latency: canary p99 must not exceed threshold
        canary_metrics.latency_p99 <= thresholds.max_latency_ms,
        
        # Success rate: must stay above minimum
        canary_metrics.success_rate >= thresholds.min_success_rate,
    ]
    
    if not all(checks):
        return RollbackDecision(
            action="ROLLBACK",
            reason=f"Canary failed: error_rate={canary_metrics.error_rate:.2%}"
        )
    
    return RollbackDecision(action="CONTINUE")

Anti-Patterns

Anti-PatternConsequenceFix
No automated rollbackHuman must notice and reactFlagger/Argo Rollouts automated analysis
Canary too large (25%+)Too much blast radiusStart at 1-5%, increase gradually
Wrong metricsCanary looks fine but UX degradedInclude business metrics, not just infra
No baseline comparisonAlert on absolute thresholds onlyCompare canary vs stable in real-time
Skip canary for “small” changesSmall change causes big outageEvery change goes through canary

Canary deployments are the gold standard for safe production releases. The cost of a slower rollout is always less than the cost of a full production incident.

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →