ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

Monitoring as Code: Version-Controlled Dashboards and Alerts

Define monitoring dashboards, alerting rules, and SLO configurations as code that is version-controlled, reviewed, and deployed through CI/CD. Covers Terraform for monitoring resources, Grafana-as-code, Prometheus alerting rules, and the workflow that prevents dashboard sprawl and undocumented alert changes.

Monitoring systems accumulate entropy. Someone creates a dashboard for an incident, never cleans it up. Another person tweaks an alert threshold at 2 AM, never documents it. After a year, you have 200 dashboards (30 actively used), 150 alert rules (40% are false-alarm-generators nobody has the confidence to delete), and no record of why any of them exist.

Monitoring as code solves this by treating monitoring configuration the same way you treat infrastructure: defined in code, reviewed in pull requests, deployed through pipelines, and auditable through version history.


What to Codify

Monitoring ArtifactToolFormat
DashboardsGrafana, DatadogJSON (Grafana), Terraform (Datadog)
Alert rulesPrometheus, CloudWatchYAML (PrometheusRule), Terraform
SLO definitionsDatadog, Nobl9, customYAML / Terraform
On-call schedulesPagerDuty, OpsGenieTerraform
Notification channelsSlack, PagerDutyTerraform
Synthetic monitorsDatadog, ChecklyTerraform / JavaScript

Grafana Dashboards as Code

{
  "dashboard": {
    "title": "Checkout API — Production",
    "uid": "checkout-api-production",
    "tags": ["checkout", "production", "team:payments"],
    "panels": [
      {
        "title": "Request Rate (req/s)",
        "type": "timeseries",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{service=\"checkout-api\", env=\"production\"}[5m]))",
            "legendFormat": "Requests/s"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "reqps",
            "thresholds": {
              "steps": [
                { "color": "green", "value": null },
                { "color": "yellow", "value": 800 },
                { "color": "red", "value": 1200 }
              ]
            }
          }
        }
      },
      {
        "title": "Error Rate (%)",
        "type": "stat",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{service=\"checkout-api\", status=~\"5..\"}[5m])) / sum(rate(http_requests_total{service=\"checkout-api\"}[5m])) * 100"
          }
        ]
      }
    ]
  }
}

Provisioning Workflow

# CI/CD pipeline for Grafana dashboards
name: Deploy Monitoring

on:
  push:
    branches: [main]
    paths:
      - 'monitoring/**'

jobs:
  deploy-dashboards:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v4

      - name: Validate dashboard JSON
        run: |
          for f in monitoring/dashboards/*.json; do
            jq empty "$f" || exit 1
          done

      - name: Deploy to Grafana
        run: |
          for f in monitoring/dashboards/*.json; do
            curl -X POST "$GRAFANA_URL/api/dashboards/db" \
              -H "Authorization: Bearer $GRAFANA_API_KEY" \
              -H "Content-Type: application/json" \
              -d @"$f"
          done

Prometheus Alerting Rules as Code

# monitoring/alerts/checkout-api.yml
groups:
  - name: checkout-api
    rules:
      - alert: CheckoutHighErrorRate
        expr: |
          sum(rate(http_requests_total{service="checkout-api", status=~"5.."}[5m]))
          / sum(rate(http_requests_total{service="checkout-api"}[5m]))
          > 0.01
        for: 5m
        labels:
          severity: critical
          team: payments
        annotations:
          summary: "Checkout API error rate > 1%"
          description: "Error rate is {{ $value | humanizePercentage }} over last 5m"
          runbook: "https://wiki.internal/runbooks/checkout-high-error-rate"

      - alert: CheckoutHighLatency
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket{service="checkout-api"}[5m])) by (le)
          ) > 2.0
        for: 10m
        labels:
          severity: warning
          team: payments
        annotations:
          summary: "Checkout API p99 latency > 2 seconds"
          runbook: "https://wiki.internal/runbooks/checkout-high-latency"

SLO Definitions as Code

# monitoring/slos/checkout-api.yml
slos:
  - name: "Checkout API Availability"
    service: checkout-api
    indicator:
      type: request-based
      good_events: 'http_requests_total{service="checkout-api", status!~"5.."}'
      total_events: 'http_requests_total{service="checkout-api"}'
    objective: 99.9     # 99.9% availability
    window: 30d         # Rolling 30-day window
    budget_alerts:
      - consumed: 50    # Alert when 50% of error budget consumed
        severity: warning
      - consumed: 80    # Alert when 80% consumed
        severity: critical

  - name: "Checkout API Latency"
    service: checkout-api
    indicator:
      type: request-based
      good_events: 'http_request_duration_seconds_bucket{service="checkout-api", le="1.0"}'
      total_events: 'http_requests_total{service="checkout-api"}'
    objective: 99.5     # 99.5% of requests under 1 second
    window: 30d

Infrastructure Monitoring with Terraform

# monitoring/terraform/pagerduty.tf

resource "pagerduty_service" "checkout_api" {
  name                    = "Checkout API"
  description             = "Payment processing service"
  escalation_policy       = pagerduty_escalation_policy.engineering.id
  alert_creation          = "create_alerts_and_incidents"
  auto_resolve_timeout    = 14400  # 4 hours
  acknowledgement_timeout = 1800   # 30 minutes
}

resource "pagerduty_escalation_policy" "engineering" {
  name      = "Engineering On-Call"
  num_loops = 2

  rule {
    escalation_delay_in_minutes = 15
    target {
      type = "schedule_reference"
      id   = pagerduty_schedule.primary_oncall.id
    }
  }

  rule {
    escalation_delay_in_minutes = 15
    target {
      type = "schedule_reference"
      id   = pagerduty_schedule.secondary_oncall.id
    }
  }
}

Anti-Patterns

Anti-PatternProblemFix
Manual dashboard editsChanges are untracked, unreviewableAll changes via Git + CI/CD
Copy-paste dashboards10 nearly-identical dashboards drift apartTemplated dashboards with variables
Alert without runbookOn-call doesn’t know what to doEvery alert links to a runbook
Orphaned dashboardsHundreds of dashboards, most unusedTags with team ownership, quarterly cleanup
Threshold changes at 2 AMUndocumented, often wrongAll threshold changes via PR

Implementation Checklist

  • Store all dashboard definitions in Git alongside application code
  • Deploy dashboards and alerts through CI/CD (not manual UI edits)
  • Validate dashboard JSON / alert YAML in CI before deploying
  • Tag every dashboard and alert with owning team
  • Require a linked runbook URL on every alert annotation
  • Define SLOs as code with error budget burn rate alerts
  • Manage PagerDuty/OpsGenie schedules and escalation policies via Terraform
  • Template dashboards with variables (service, environment) to prevent copy-paste
  • Run quarterly monitoring cleanup: delete unused dashboards, review alert noise
  • Version alert threshold changes: every change is a PR with justification
Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →