ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

Kubernetes Operators: Custom Resource Management

Build and deploy Kubernetes operators for custom resource management. Covers operator patterns, controller architecture, CRD design, reconciliation loops, and production best practices.

Kubernetes operators extend the Kubernetes API to manage custom resources using the same reconciliation patterns that manage built-in resources like Deployments and Services. Instead of writing scripts to manage your database clusters, message queues, or ML pipelines, you encode that operational knowledge into an operator that runs inside the cluster and continuously reconciles desired state with actual state.

This guide covers the architecture, design patterns, and production considerations for building and deploying Kubernetes operators.


Why Operators

Without OperatorsWith Operators
Manual runbooks for database failoverAutomatic failover via reconciliation loop
Scripts for backup schedulingCRD defines backup policy, operator executes
Helm charts that deploy but don’t manageOperator continually manages lifecycle
Human on-call for scaling decisionsOperator auto-scales based on metrics

Operator Architecture

┌─────────────────────────────────────────┐
│  Kubernetes API Server                   │
│  ┌─────────────┐  ┌──────────────────┐  │
│  │ Custom       │  │ Built-in          │  │
│  │ Resources    │  │ Resources         │  │
│  │ (CRDs)       │  │ (Pods, Services)  │  │
│  └──────┬──────┘  └──────────────────┘  │
└─────────┼───────────────────────────────┘
          │ Watch events

┌─────────────────┐
│   Controller     │ ← Your operator code
│  ┌────────────┐  │
│  │ Reconcile  │  │ ← Triggered on every change
│  │ Loop       │  │
│  └────────────┘  │
│  ┌────────────┐  │
│  │ Business   │  │ ← Domain-specific logic
│  │ Logic      │  │
│  └────────────┘  │
└─────────────────┘

Custom Resource Definition

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: databases.garnet.io
spec:
  group: garnet.io
  names:
    kind: Database
    plural: databases
    singular: database
    shortNames: ["db"]
  scope: Namespaced
  versions:
    - name: v1alpha1
      served: true
      storage: true
      schema:
        openAPIV3Schema:
          type: object
          properties:
            spec:
              type: object
              required: ["engine", "version", "storage"]
              properties:
                engine:
                  type: string
                  enum: ["postgres", "mysql", "redis"]
                version:
                  type: string
                storage:
                  type: string
                  pattern: "^[0-9]+(Gi|Ti)$"
                replicas:
                  type: integer
                  minimum: 1
                  maximum: 7
                  default: 1
                backup:
                  type: object
                  properties:
                    enabled:
                      type: boolean
                      default: true
                    schedule:
                      type: string
                      default: "0 2 * * *"
                    retention:
                      type: string
                      default: "7d"
            status:
              type: object
              properties:
                phase:
                  type: string
                  enum: ["Pending", "Creating", "Running", "Failed", "Deleting"]
                endpoint:
                  type: string
                readyReplicas:
                  type: integer
                lastBackup:
                  type: string
                  format: date-time

Custom Resource Instance

apiVersion: garnet.io/v1alpha1
kind: Database
metadata:
  name: orders-db
  namespace: production
spec:
  engine: postgres
  version: "16.2"
  storage: 100Gi
  replicas: 3
  backup:
    enabled: true
    schedule: "0 */6 * * *"
    retention: "30d"

Reconciliation Loop

The reconcile function is the heart of every operator:

func (r *DatabaseReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    log := r.Log.WithValues("database", req.NamespacedName)
    
    // 1. Fetch the custom resource
    db := &garnetv1.Database{}
    if err := r.Get(ctx, req.NamespacedName, db); err != nil {
        if errors.IsNotFound(err) {
            return ctrl.Result{}, nil // Resource deleted, nothing to do
        }
        return ctrl.Result{}, err
    }
    
    // 2. Handle deletion (finalizers)
    if !db.DeletionTimestamp.IsZero() {
        return r.handleDeletion(ctx, db)
    }
    
    // 3. Ensure finalizer is set
    if !containsFinalizer(db, finalizerName) {
        addFinalizer(db, finalizerName)
        return ctrl.Result{}, r.Update(ctx, db)
    }
    
    // 4. Reconcile desired state
    switch db.Status.Phase {
    case "":
        return r.createDatabase(ctx, db)
    case "Pending":
        return r.checkCreationStatus(ctx, db)
    case "Running":
        return r.ensureDesiredState(ctx, db)
    case "Failed":
        return r.handleFailure(ctx, db)
    }
    
    // 5. Requeue after interval for periodic checks
    return ctrl.Result{RequeueAfter: 30 * time.Second}, nil
}

Key Reconciliation Principles

  1. Idempotent: Running reconcile multiple times produces the same result
  2. Level-triggered: React to current state, not events (don’t assume what happened)
  3. Convergent: Always move toward desired state, regardless of current state
  4. Owns resources: Use OwnerReferences so child resources are garbage-collected

Framework Selection

FrameworkLanguageBest For
Operator SDKGoProduction operators, Kubernetes-native
KubebuilderGoCustom APIs and controllers
KopfPythonRapid prototyping, simpler operators
KUDOYAMLStateful service operators
MetacontrollerAny (webhooks)Teams without Go expertise

Production Best Practices

Status Reporting

status:
  phase: Running
  conditions:
    - type: Ready
      status: "True"
      lastTransitionTime: "2025-03-01T10:00:00Z"
    - type: BackupComplete
      status: "True"
      lastTransitionTime: "2025-03-01T02:15:00Z"
    - type: ReplicationHealthy
      status: "True"
      lastTransitionTime: "2025-03-01T10:00:00Z"
  observedGeneration: 3
  endpoint: "orders-db.production.svc.cluster.local:5432"
  readyReplicas: 3
  currentVersion: "16.2"

RBAC Configuration

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: database-operator
rules:
  - apiGroups: ["garnet.io"]
    resources: ["databases", "databases/status"]
    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
  - apiGroups: [""]
    resources: ["pods", "services", "persistentvolumeclaims", "secrets"]
    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
  - apiGroups: ["apps"]
    resources: ["statefulsets"]
    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]

Anti-Patterns

Anti-PatternProblemFix
Edge-triggered logicMissing events means missed state changesLevel-triggered: always reconcile from current state
No finalizersResources leak when CR is deletedAdd finalizers for cleanup logic
Unbounded reconciliationOperator storms the API serverExponential backoff, rate limiting
No status reportingUsers can’t tell what the operator is doingRich status with conditions, phases, events
God operatorOne operator manages everythingOne operator per domain, clear boundaries

Checklist

  • CRD designed with proper validation and defaults
  • Reconcile loop is idempotent and level-triggered
  • Finalizers handle cleanup on deletion
  • OwnerReferences set on child resources
  • Status subresource reports phase, conditions, and progress
  • RBAC follows least privilege
  • E2E tests with envtest or kind
  • Metrics exposed for Prometheus (reconcile latency, errors)
  • Leader election enabled for HA deployment
  • Graceful degradation when dependencies unavailable

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For Kubernetes consulting, visit garnetgrid.com. :::

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →