Monitoring as Code: Version-Controlled Dashboards and Alerts

Monitoring systems accumulate entropy. Someone creates a dashboard for an incident, never cleans it up. Another person tweaks an alert threshold at 2 AM, never documents it. After a year, you have 200 dashboards (30 actively used), 150 alert rules (40% are false-alarm-generators nobody has the confidence to delete), and no record of why any of them exist.

Monitoring as code solves this by treating monitoring configuration the same way you treat infrastructure: defined in code, reviewed in pull requests, deployed through pipelines, and auditable through version history.

What to Codify

Monitoring Artifact	Tool	Format
Dashboards	Grafana, Datadog	JSON (Grafana), Terraform (Datadog)
Alert rules	Prometheus, CloudWatch	YAML (PrometheusRule), Terraform
SLO definitions	Datadog, Nobl9, custom	YAML / Terraform
On-call schedules	PagerDuty, OpsGenie	Terraform
Notification channels	Slack, PagerDuty	Terraform
Synthetic monitors	Datadog, Checkly	Terraform / JavaScript

Grafana Dashboards as Code

{
  "dashboard": {
    "title": "Checkout API — Production",
    "uid": "checkout-api-production",
    "tags": ["checkout", "production", "team:payments"],
    "panels": [
      {
        "title": "Request Rate (req/s)",
        "type": "timeseries",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{service=\"checkout-api\", env=\"production\"}[5m]))",
            "legendFormat": "Requests/s"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "reqps",
            "thresholds": {
              "steps": [
                { "color": "green", "value": null },
                { "color": "yellow", "value": 800 },
                { "color": "red", "value": 1200 }
              ]
            }
          }
        }
      },
      {
        "title": "Error Rate (%)",
        "type": "stat",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{service=\"checkout-api\", status=~\"5..\"}[5m])) / sum(rate(http_requests_total{service=\"checkout-api\"}[5m])) * 100"
          }
        ]
      }
    ]
  }
}

Provisioning Workflow

# CI/CD pipeline for Grafana dashboards
name: Deploy Monitoring

on:
  push:
    branches: [main]
    paths:
      - 'monitoring/**'

jobs:
  deploy-dashboards:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v4

      - name: Validate dashboard JSON
        run: |
          for f in monitoring/dashboards/*.json; do
            jq empty "$f" || exit 1
          done

      - name: Deploy to Grafana
        run: |
          for f in monitoring/dashboards/*.json; do
            curl -X POST "$GRAFANA_URL/api/dashboards/db" \
              -H "Authorization: Bearer $GRAFANA_API_KEY" \
              -H "Content-Type: application/json" \
              -d @"$f"
          done

Prometheus Alerting Rules as Code

# monitoring/alerts/checkout-api.yml
groups:
  - name: checkout-api
    rules:
      - alert: CheckoutHighErrorRate
        expr: |
          sum(rate(http_requests_total{service="checkout-api", status=~"5.."}[5m]))
          / sum(rate(http_requests_total{service="checkout-api"}[5m]))
          > 0.01
        for: 5m
        labels:
          severity: critical
          team: payments
        annotations:
          summary: "Checkout API error rate > 1%"
          description: "Error rate is {{ $value | humanizePercentage }} over last 5m"
          runbook: "https://wiki.internal/runbooks/checkout-high-error-rate"

      - alert: CheckoutHighLatency
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket{service="checkout-api"}[5m])) by (le)
          ) > 2.0
        for: 10m
        labels:
          severity: warning
          team: payments
        annotations:
          summary: "Checkout API p99 latency > 2 seconds"
          runbook: "https://wiki.internal/runbooks/checkout-high-latency"

SLO Definitions as Code

# monitoring/slos/checkout-api.yml
slos:
  - name: "Checkout API Availability"
    service: checkout-api
    indicator:
      type: request-based
      good_events: 'http_requests_total{service="checkout-api", status!~"5.."}'
      total_events: 'http_requests_total{service="checkout-api"}'
    objective: 99.9     # 99.9% availability
    window: 30d         # Rolling 30-day window
    budget_alerts:
      - consumed: 50    # Alert when 50% of error budget consumed
        severity: warning
      - consumed: 80    # Alert when 80% consumed
        severity: critical

  - name: "Checkout API Latency"
    service: checkout-api
    indicator:
      type: request-based
      good_events: 'http_request_duration_seconds_bucket{service="checkout-api", le="1.0"}'
      total_events: 'http_requests_total{service="checkout-api"}'
    objective: 99.5     # 99.5% of requests under 1 second
    window: 30d

Infrastructure Monitoring with Terraform

# monitoring/terraform/pagerduty.tf

resource "pagerduty_service" "checkout_api" {
  name                    = "Checkout API"
  description             = "Payment processing service"
  escalation_policy       = pagerduty_escalation_policy.engineering.id
  alert_creation          = "create_alerts_and_incidents"
  auto_resolve_timeout    = 14400  # 4 hours
  acknowledgement_timeout = 1800   # 30 minutes
}

resource "pagerduty_escalation_policy" "engineering" {
  name      = "Engineering On-Call"
  num_loops = 2

  rule {
    escalation_delay_in_minutes = 15
    target {
      type = "schedule_reference"
      id   = pagerduty_schedule.primary_oncall.id
    }
  }

  rule {
    escalation_delay_in_minutes = 15
    target {
      type = "schedule_reference"
      id   = pagerduty_schedule.secondary_oncall.id
    }
  }
}

Anti-Patterns

Anti-Pattern	Problem	Fix
Manual dashboard edits	Changes are untracked, unreviewable	All changes via Git + CI/CD
Copy-paste dashboards	10 nearly-identical dashboards drift apart	Templated dashboards with variables
Alert without runbook	On-call doesn’t know what to do	Every alert links to a runbook
Orphaned dashboards	Hundreds of dashboards, most unused	Tags with team ownership, quarterly cleanup
Threshold changes at 2 AM	Undocumented, often wrong	All threshold changes via PR

What to Codify

Grafana Dashboards as Code

Provisioning Workflow

Prometheus Alerting Rules as Code

SLO Definitions as Code

Infrastructure Monitoring with Terraform

Anti-Patterns

Implementation Checklist

More in DevOps & CI/CD

Chaos Engineering in Practice

Canary Deployments

CI/CD Pipeline Maturity Model