How to Automate Infrastructure Testing
Test your infrastructure like application code. Covers Terraform testing, policy-as-code, drift detection, chaos engineering basics, and CI/CD integration.
“It works in staging” is the infrastructure equivalent of “it works on my machine.” Infrastructure testing prevents the 2 AM incident caused by a misconfigured security group, an accidentally deleted database, or a networking change that slipped through code review.
Application teams wouldn’t dream of deploying code without tests. Yet infrastructure — which has a far larger blast radius — is often deployed with nothing more than a manual terraform apply and a hopeful prayer. This guide covers the four layers of infrastructure testing: static analysis, integration testing, policy enforcement, and drift detection.
The Infrastructure Testing Pyramid
┌───────────┐
│ Chaos │ ← Production resilience
│Engineering│ (expensive, high value)
─┼───────────┼─
│Integration │ ← Deploy and verify
│ Tests │ (Terratest, real resources)
─┼────────────┼─
│ Policy-as- │ ← Guardrails
│ Code │ (OPA, Sentinel)
─┼─────────────┼─
│ Static │ ← Fast feedback
│ Analysis │ (validate, lint, scan)
─┴──────────────┴─
Each layer catches different classes of issues. Static analysis catches syntax errors and known misconfigurations in seconds. Policy-as-code enforces organizational rules. Integration tests verify real infrastructure behavior. Chaos engineering validates resilience in production.
Layer 1: Static Analysis
Static analysis is the fastest feedback loop — it runs in seconds without deploying any infrastructure. Every PR should pass these checks before human review begins.
# Terraform validate — syntax and internal consistency
terraform init -backend=false # Init without backend (faster)
terraform validate # Check syntax and references
terraform fmt -check -recursive # Enforce consistent formatting
# Terraform plan with safety checks
terraform plan -out=plan.tfplan
# Automated plan analysis — alert on destructive changes
terraform show -json plan.tfplan | \
jq '.resource_changes[] | select(.change.actions | index("delete"))' | \
jq '.address'
# ^^ If this outputs anything, a resource is being DESTROYED
# Block the PR and require manual approval
Security Scanners
| Tool | What It Checks | Speed | Integration |
|---|---|---|---|
| tfsec | Terraform security misconfigurations | Very fast (seconds) | GitHub Actions, pre-commit hook |
| Checkov | Terraform, CloudFormation, Kubernetes | Fast (seconds) | GitHub Actions, bridgecrew platform |
| Trivy | IaC, containers, filesystems | Fast | GitHub Actions, CI/CD |
| KICS | Terraform, Ansible, Docker, K8s | Fast | CI/CD |
| Infracost | Cost estimation from plan files | Fast | GitHub Actions (PR comment) |
# Run multiple scanners in CI
tfsec ./terraform/ --minimum-severity HIGH
checkov -d ./terraform/ --framework terraform --hard-fail-on HIGH
infracost breakdown --path ./terraform/ --format json > cost.json
Layer 2: Policy-as-Code
Policy-as-code allows you to codify organizational rules (no public S3 buckets, all resources must be tagged, no instances larger than m5.2xlarge) and enforce them automatically in the CI/CD pipeline.
OPA (Open Policy Agent) — Terraform Plan Validation
# policy/security.rego — OPA policy for Terraform plans
package terraform.security
# Deny public S3 buckets
deny[msg] {
resource := input.resource_changes[_]
resource.type == "aws_s3_bucket"
resource.change.after.acl == "public-read"
msg := sprintf("S3 bucket '%s' must not be public", [resource.address])
}
# Deny unencrypted instances
deny[msg] {
resource := input.resource_changes[_]
resource.type == "aws_instance"
not resource.change.after.root_block_device[0].encrypted
msg := sprintf("EC2 '%s' must have encrypted root volume", [resource.address])
}
# Require tags on all taggable resources
deny[msg] {
resource := input.resource_changes[_]
resource.change.after.tags
not resource.change.after.tags.Environment
msg := sprintf("Resource '%s' must have 'Environment' tag", [resource.address])
}
# Deny oversized instances (cost control)
deny[msg] {
resource := input.resource_changes[_]
resource.type == "aws_instance"
forbidden := {"m5.4xlarge", "m5.8xlarge", "m5.12xlarge", "m5.16xlarge", "m5.24xlarge"}
forbidden[resource.change.after.instance_type]
msg := sprintf("EC2 '%s' uses oversized instance type '%s' — requires approval",
[resource.address, resource.change.after.instance_type])
}
# Run OPA in CI/CD pipeline
terraform plan -out=plan.tfplan
terraform show -json plan.tfplan > plan.json
opa eval --data policy/ --input plan.json "data.terraform.security.deny"
# If any deny rules fire, the pipeline fails
Common Policy Categories
| Category | Example Rules |
|---|---|
| Security | No public endpoints, encryption required, MFA on root |
| Cost | Max instance size, require spot for dev, reserved for prod |
| Compliance | Required tags, specific regions only, logging enabled |
| Operational | Naming conventions, backup policies, DR configuration |
Layer 3: Integration Testing (Terratest)
Integration tests deploy real infrastructure, verify it works correctly, then destroy it. They are slower (minutes) but catch issues that static analysis cannot — like networking misconfigurations, IAM permission errors, and service-to-service connectivity problems.
package test
import (
"testing"
"time"
"github.com/gruntwork-io/terratest/modules/terraform"
"github.com/stretchr/testify/assert"
http_helper "github.com/gruntwork-io/terratest/modules/http-helper"
)
func TestWebServer(t *testing.T) {
t.Parallel()
opts := &terraform.Options{
TerraformDir: "../modules/web-server",
Vars: map[string]interface{}{
"instance_type": "t3.micro",
"environment": "test",
},
}
// ALWAYS clean up — even if the test fails
defer terraform.Destroy(t, opts)
terraform.InitAndApply(t, opts)
// Verify the web server responds with 200 OK
url := terraform.Output(t, opts, "url")
http_helper.HttpGetWithRetry(t, url, nil, 200, "OK", 10, 5*time.Second)
// Verify security group is properly configured
sgId := terraform.Output(t, opts, "security_group_id")
assert.NotEmpty(t, sgId)
// Verify the instance is in the private subnet
subnetId := terraform.Output(t, opts, "subnet_id")
assert.Contains(t, subnetId, "private")
}
Terratest Best Practices
| Practice | Why | Implementation |
|---|---|---|
Always use defer Destroy | Prevents resource leaks on test failure | First line after Options |
| Use parallel tests | Reduce total test time | t.Parallel() + unique resource names |
| Test in a dedicated AWS account | Prevent interference with real environments | Separate “testing” account |
| Use small instances | Minimize cost during testing | t3.micro/t3.small |
| Set timeouts | Prevent tests from running forever | terraform.Options{...MaxRetries: 3} |
| Test destruction | Verify clean teardown | Check resources are actually deleted |
Layer 4: Drift Detection
Drift occurs when the actual infrastructure state diverges from what’s defined in code. Common causes: console changes, manual fixes during incidents, and automated processes (auto-scaling, self-healing).
# Terraform drift detection using detailed exit codes
terraform plan -detailed-exitcode
# Exit code 0 = no changes (infrastructure matches code)
# Exit code 1 = error (Terraform failed)
# Exit code 2 = changes detected (DRIFT!)
Automated Drift Detection Workflow
# .github/workflows/drift-detection.yml
name: Infrastructure Drift Detection
on:
schedule:
- cron: '0 8 * * *' # Daily at 8 AM UTC
jobs:
detect-drift:
runs-on: ubuntu-latest
strategy:
matrix:
environment: [staging, production]
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
- name: Terraform Init
run: terraform init
working-directory: terraform/${{ matrix.environment }}
- name: Detect Drift
id: drift
run: |
terraform plan -detailed-exitcode 2>&1 || EXIT_CODE=$?
if [ "$EXIT_CODE" = "2" ]; then
echo "drift_detected=true" >> "$GITHUB_OUTPUT"
echo "⚠️ DRIFT DETECTED in ${{ matrix.environment }}"
# Send alert to Slack/Teams
curl -X POST "${{ secrets.SLACK_WEBHOOK }}" \
-d "{\"text\":\"⚠️ Infrastructure drift detected in ${{ matrix.environment }}! Run terraform plan to review.\"}"
fi
working-directory: terraform/${{ matrix.environment }}
How to Handle Drift
| Drift Type | Response | Example |
|---|---|---|
| Intentional (manual fix during incident) | Update IaC to match reality | Emergency security group change → add to Terraform |
| Accidental (console click mistake) | Reapply IaC to correct | Someone changed a setting in the console |
| Auto-scaling (expected variation) | Ignore in drift detection | Instance count changes within ASG bounds |
| External system (CloudFormation, CDK) | Exclude from detection or import | Resources managed by another tool |
CI/CD Pipeline for Infrastructure
The complete pipeline integrates all four testing layers:
# .github/workflows/infrastructure.yml
name: Infrastructure CI/CD
on:
pull_request:
paths: ['terraform/**']
push:
branches: [main]
paths: ['terraform/**']
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
- run: terraform fmt -check -recursive
- run: terraform init -backend=false
- run: terraform validate
security-scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: tfsec
uses: aquasecurity/tfsec-action@v1.0.0
- name: checkov
uses: bridgecrewio/checkov-action@v12
policy-check:
needs: [validate]
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
- run: terraform init && terraform plan -out=plan.tfplan
- run: terraform show -json plan.tfplan > plan.json
- run: opa eval --data policy/ --input plan.json "data.terraform.security.deny"
plan:
needs: [validate, security-scan, policy-check]
runs-on: ubuntu-latest
if: github.event_name == 'pull_request'
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
- run: terraform init
- run: terraform plan -no-color -out=plan.tfplan
- name: Post Plan to PR
uses: actions/github-script@v7
with:
script: |
const output = `#### Terraform Plan 📖\n\`\`\`\n${process.env.PLAN}\n\`\`\``;
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: output
});
apply:
needs: [validate, security-scan, policy-check]
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main'
environment: production # Requires manual approval
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
- run: terraform init
- run: terraform apply -auto-approve
Infrastructure Testing Checklist
-
terraform validateruns in CI on every PR -
terraform fmtenforced (no style drift in code) - Security scanning (tfsec + Checkov) blocks PRs on high/critical findings
- Policy-as-code (OPA/Sentinel) enforces organizational guardrails
- Integration tests (Terratest) for critical modules (networking, security, data)
- Drift detection running daily with alerts to on-call team
- Plan output posted as PR comments for reviewer visibility
- Apply only from main branch (no manual
terraform applyin production) - State file encrypted, access-controlled, and backed up
- Blast radius limited (separate state files per environment and component)
- Cost estimation integrated into PR workflow (Infracost)
- Destructive changes require explicit manual approval
:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For infrastructure advisory, visit garnetgrid.com. :::