Monitoring as Code: Version-Controlled Dashboards and Alerts
Define monitoring dashboards, alerting rules, and SLO configurations as code that is version-controlled, reviewed, and deployed through CI/CD. Covers Terraform for monitoring resources, Grafana-as-code, Prometheus alerting rules, and the workflow that prevents dashboard sprawl and undocumented alert changes.
Monitoring systems accumulate entropy. Someone creates a dashboard for an incident, never cleans it up. Another person tweaks an alert threshold at 2 AM, never documents it. After a year, you have 200 dashboards (30 actively used), 150 alert rules (40% are false-alarm-generators nobody has the confidence to delete), and no record of why any of them exist.
Monitoring as code solves this by treating monitoring configuration the same way you treat infrastructure: defined in code, reviewed in pull requests, deployed through pipelines, and auditable through version history.
What to Codify
| Monitoring Artifact | Tool | Format |
|---|---|---|
| Dashboards | Grafana, Datadog | JSON (Grafana), Terraform (Datadog) |
| Alert rules | Prometheus, CloudWatch | YAML (PrometheusRule), Terraform |
| SLO definitions | Datadog, Nobl9, custom | YAML / Terraform |
| On-call schedules | PagerDuty, OpsGenie | Terraform |
| Notification channels | Slack, PagerDuty | Terraform |
| Synthetic monitors | Datadog, Checkly | Terraform / JavaScript |
Grafana Dashboards as Code
{
"dashboard": {
"title": "Checkout API — Production",
"uid": "checkout-api-production",
"tags": ["checkout", "production", "team:payments"],
"panels": [
{
"title": "Request Rate (req/s)",
"type": "timeseries",
"targets": [
{
"expr": "sum(rate(http_requests_total{service=\"checkout-api\", env=\"production\"}[5m]))",
"legendFormat": "Requests/s"
}
],
"fieldConfig": {
"defaults": {
"unit": "reqps",
"thresholds": {
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 800 },
{ "color": "red", "value": 1200 }
]
}
}
}
},
{
"title": "Error Rate (%)",
"type": "stat",
"targets": [
{
"expr": "sum(rate(http_requests_total{service=\"checkout-api\", status=~\"5..\"}[5m])) / sum(rate(http_requests_total{service=\"checkout-api\"}[5m])) * 100"
}
]
}
]
}
}
Provisioning Workflow
# CI/CD pipeline for Grafana dashboards
name: Deploy Monitoring
on:
push:
branches: [main]
paths:
- 'monitoring/**'
jobs:
deploy-dashboards:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Validate dashboard JSON
run: |
for f in monitoring/dashboards/*.json; do
jq empty "$f" || exit 1
done
- name: Deploy to Grafana
run: |
for f in monitoring/dashboards/*.json; do
curl -X POST "$GRAFANA_URL/api/dashboards/db" \
-H "Authorization: Bearer $GRAFANA_API_KEY" \
-H "Content-Type: application/json" \
-d @"$f"
done
Prometheus Alerting Rules as Code
# monitoring/alerts/checkout-api.yml
groups:
- name: checkout-api
rules:
- alert: CheckoutHighErrorRate
expr: |
sum(rate(http_requests_total{service="checkout-api", status=~"5.."}[5m]))
/ sum(rate(http_requests_total{service="checkout-api"}[5m]))
> 0.01
for: 5m
labels:
severity: critical
team: payments
annotations:
summary: "Checkout API error rate > 1%"
description: "Error rate is {{ $value | humanizePercentage }} over last 5m"
runbook: "https://wiki.internal/runbooks/checkout-high-error-rate"
- alert: CheckoutHighLatency
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{service="checkout-api"}[5m])) by (le)
) > 2.0
for: 10m
labels:
severity: warning
team: payments
annotations:
summary: "Checkout API p99 latency > 2 seconds"
runbook: "https://wiki.internal/runbooks/checkout-high-latency"
SLO Definitions as Code
# monitoring/slos/checkout-api.yml
slos:
- name: "Checkout API Availability"
service: checkout-api
indicator:
type: request-based
good_events: 'http_requests_total{service="checkout-api", status!~"5.."}'
total_events: 'http_requests_total{service="checkout-api"}'
objective: 99.9 # 99.9% availability
window: 30d # Rolling 30-day window
budget_alerts:
- consumed: 50 # Alert when 50% of error budget consumed
severity: warning
- consumed: 80 # Alert when 80% consumed
severity: critical
- name: "Checkout API Latency"
service: checkout-api
indicator:
type: request-based
good_events: 'http_request_duration_seconds_bucket{service="checkout-api", le="1.0"}'
total_events: 'http_requests_total{service="checkout-api"}'
objective: 99.5 # 99.5% of requests under 1 second
window: 30d
Infrastructure Monitoring with Terraform
# monitoring/terraform/pagerduty.tf
resource "pagerduty_service" "checkout_api" {
name = "Checkout API"
description = "Payment processing service"
escalation_policy = pagerduty_escalation_policy.engineering.id
alert_creation = "create_alerts_and_incidents"
auto_resolve_timeout = 14400 # 4 hours
acknowledgement_timeout = 1800 # 30 minutes
}
resource "pagerduty_escalation_policy" "engineering" {
name = "Engineering On-Call"
num_loops = 2
rule {
escalation_delay_in_minutes = 15
target {
type = "schedule_reference"
id = pagerduty_schedule.primary_oncall.id
}
}
rule {
escalation_delay_in_minutes = 15
target {
type = "schedule_reference"
id = pagerduty_schedule.secondary_oncall.id
}
}
}
Anti-Patterns
| Anti-Pattern | Problem | Fix |
|---|---|---|
| Manual dashboard edits | Changes are untracked, unreviewable | All changes via Git + CI/CD |
| Copy-paste dashboards | 10 nearly-identical dashboards drift apart | Templated dashboards with variables |
| Alert without runbook | On-call doesn’t know what to do | Every alert links to a runbook |
| Orphaned dashboards | Hundreds of dashboards, most unused | Tags with team ownership, quarterly cleanup |
| Threshold changes at 2 AM | Undocumented, often wrong | All threshold changes via PR |
Implementation Checklist
- Store all dashboard definitions in Git alongside application code
- Deploy dashboards and alerts through CI/CD (not manual UI edits)
- Validate dashboard JSON / alert YAML in CI before deploying
- Tag every dashboard and alert with owning team
- Require a linked runbook URL on every alert annotation
- Define SLOs as code with error budget burn rate alerts
- Manage PagerDuty/OpsGenie schedules and escalation policies via Terraform
- Template dashboards with variables (service, environment) to prevent copy-paste
- Run quarterly monitoring cleanup: delete unused dashboards, review alert noise
- Version alert threshold changes: every change is a PR with justification