Infrastructure as Code Testing Strategies
How to test infrastructure as code before deployment. Covers unit testing Terraform, policy-as-code with OPA, integration testing, drift detection, and CI/CD for infrastructure.
Infrastructure as Code (IaC) changed how we provision infrastructure. But most teams still deploy IaC changes without testing — they run terraform plan, eyeball the diff, and hope for the best. This is the equivalent of deploying application code without running tests. At scale, untested infrastructure changes cause outages that are harder to diagnose and longer to resolve than application bugs.
Testing IaC requires different strategies than testing application code. You can’t unit test a VPC the same way you unit test a function. But you can validate configurations, simulate plans, enforce policies, and integration test in ephemeral environments before anything touches production.
The IaC Testing Pyramid
| Layer | What It Tests | Speed | Cost |
|---|---|---|---|
| Static Analysis | Syntax, formatting, security rules | Seconds | Free |
| Unit Tests | Module logic, variable validation | Seconds | Free |
| Policy Tests | Compliance rules, guardrails | Seconds | Free |
| Plan Tests | Expected resource changes | Minutes | Free |
| Integration Tests | Actual infrastructure behavior | 10-30 min | Cloud costs |
| E2E Tests | Full stack deployment | 30-60 min | Cloud costs |
Layer 1: Static Analysis
Run these on every commit. They catch 40% of issues before any cloud API is called.
# Terraform
terraform fmt -check -recursive
terraform validate
tflint --recursive
# Security scanning
tfsec .
checkov -d .
These tools catch:
- Insecure defaults (S3 bucket without encryption, security group open to 0.0.0.0/0)
- Syntax errors and deprecated features
- Missing required tags
- Resource naming convention violations
Layer 2: Policy-as-Code with OPA
Open Policy Agent (OPA) lets you write compliance rules that block non-compliant infrastructure before it’s created.
# policy/security.rego
# Deny public S3 buckets
deny[msg] {
resource := input.planned_values.root_module.resources[_]
resource.type == "aws_s3_bucket"
resource.values.acl == "public-read"
msg := sprintf("S3 bucket '%s' cannot be public", [resource.address])
}
# Require encryption on all RDS instances
deny[msg] {
resource := input.planned_values.root_module.resources[_]
resource.type == "aws_db_instance"
not resource.values.storage_encrypted
msg := sprintf("RDS instance '%s' must have encryption enabled", [resource.address])
}
# Enforce instance size limits
deny[msg] {
resource := input.planned_values.root_module.resources[_]
resource.type == "aws_instance"
allowed := {"t3.micro", "t3.small", "t3.medium", "t3.large", "t3.xlarge"}
not allowed[resource.values.instance_type]
msg := sprintf("Instance '%s' uses disallowed type '%s'",
[resource.address, resource.values.instance_type])
}
Run OPA against terraform plan -out=plan.json && terraform show -json plan.json:
opa eval --data policy/ --input plan.json "data.terraform.deny[msg]"
Layer 3: Integration Testing
For critical infrastructure, spin up real resources in an isolated test account, validate they work, then tear everything down.
# test_network.py (using Terratest patterns)
import pytest
import subprocess
import json
class TestNetworkModule:
@pytest.fixture(autouse=True)
def setup_teardown(self, tmp_path):
# Apply infrastructure
subprocess.run(["terraform", "init"], cwd="modules/network", check=True)
subprocess.run(["terraform", "apply", "-auto-approve",
f"-var=env=test-{uuid4().hex[:8]}"],
cwd="modules/network", check=True)
yield
# Destroy after test
subprocess.run(["terraform", "destroy", "-auto-approve"],
cwd="modules/network", check=True)
def test_vpc_has_correct_cidr(self):
output = subprocess.run(
["terraform", "output", "-json"],
cwd="modules/network", capture_output=True, text=True
)
outputs = json.loads(output.stdout)
assert outputs["vpc_cidr"]["value"] == "10.0.0.0/16"
def test_private_subnets_not_publicly_accessible(self):
# Use AWS SDK to verify subnet routing
pass
Cost control: Integration tests run in a dedicated test account with aggressive auto-cleanup. Set a maximum test duration (30 minutes) with automatic terraform destroy on timeout.
Drift Detection
Infrastructure drift — when reality diverges from your IaC definitions — is the silent killer. Common causes: manual console changes, out-of-band scripts, and cloud provider auto-updates.
Detection Protocol:
# Run daily via CI/CD
terraform plan -detailed-exitcode
# Exit code 0: No changes (in sync)
# Exit code 1: Error
# Exit code 2: Changes detected (drift!)
When drift is detected:
- Alert the infrastructure team immediately
- Determine if drift is intentional (emergency fix) or accidental
- Either update IaC to match reality or revert the drift
- Document the root cause and prevent recurrence
CI/CD Pipeline for Infrastructure
Commit → Lint → Validate → Policy Check → Plan → Review → Apply → Verify
Non-negotiable rules:
planoutput must be reviewed by a human beforeapply(for production)- Policy check failures block the pipeline (no exceptions)
- Apply to staging before production (always)
- Keep plan and apply in the same pipeline run (prevent plan staleness)
- Lock state files during applies (prevent concurrent modifications)
The teams that test their infrastructure as rigorously as their application code deploy with confidence. Everyone else deploys with anxiety — and anxiety scales poorly.