How to Automate Infrastructure Testing

“It works in staging” is the infrastructure equivalent of “it works on my machine.” Infrastructure testing prevents the 2 AM incident caused by a misconfigured security group, an accidentally deleted database, or a networking change that slipped through code review.

Application teams wouldn’t dream of deploying code without tests. Yet infrastructure — which has a far larger blast radius — is often deployed with nothing more than a manual terraform apply and a hopeful prayer. This guide covers the four layers of infrastructure testing: static analysis, integration testing, policy enforcement, and drift detection.

The Infrastructure Testing Pyramid

         ┌───────────┐
         │   Chaos   │  ← Production resilience
         │Engineering│     (expensive, high value)
        ─┼───────────┼─
        │Integration │  ← Deploy and verify
        │  Tests     │     (Terratest, real resources)
       ─┼────────────┼─
       │ Policy-as-  │  ← Guardrails
       │   Code      │     (OPA, Sentinel)
      ─┼─────────────┼─
      │   Static     │  ← Fast feedback
      │  Analysis    │     (validate, lint, scan)
     ─┴──────────────┴─

Each layer catches different classes of issues. Static analysis catches syntax errors and known misconfigurations in seconds. Policy-as-code enforces organizational rules. Integration tests verify real infrastructure behavior. Chaos engineering validates resilience in production.

Layer 1: Static Analysis

Static analysis is the fastest feedback loop — it runs in seconds without deploying any infrastructure. Every PR should pass these checks before human review begins.

# Terraform validate — syntax and internal consistency
terraform init -backend=false      # Init without backend (faster)
terraform validate                  # Check syntax and references
terraform fmt -check -recursive     # Enforce consistent formatting

# Terraform plan with safety checks
terraform plan -out=plan.tfplan

# Automated plan analysis — alert on destructive changes
terraform show -json plan.tfplan | \
  jq '.resource_changes[] | select(.change.actions | index("delete"))' | \
  jq '.address'
# ^^ If this outputs anything, a resource is being DESTROYED
# Block the PR and require manual approval

Security Scanners

Tool	What It Checks	Speed	Integration
tfsec	Terraform security misconfigurations	Very fast (seconds)	GitHub Actions, pre-commit hook
Checkov	Terraform, CloudFormation, Kubernetes	Fast (seconds)	GitHub Actions, bridgecrew platform
Trivy	IaC, containers, filesystems	Fast	GitHub Actions, CI/CD
KICS	Terraform, Ansible, Docker, K8s	Fast	CI/CD
Infracost	Cost estimation from plan files	Fast	GitHub Actions (PR comment)

# Run multiple scanners in CI
tfsec ./terraform/ --minimum-severity HIGH
checkov -d ./terraform/ --framework terraform --hard-fail-on HIGH
infracost breakdown --path ./terraform/ --format json > cost.json

Layer 2: Policy-as-Code

Policy-as-code allows you to codify organizational rules (no public S3 buckets, all resources must be tagged, no instances larger than m5.2xlarge) and enforce them automatically in the CI/CD pipeline.

OPA (Open Policy Agent) — Terraform Plan Validation

# policy/security.rego — OPA policy for Terraform plans

package terraform.security

# Deny public S3 buckets
deny[msg] {
    resource := input.resource_changes[_]
    resource.type == "aws_s3_bucket"
    resource.change.after.acl == "public-read"
    msg := sprintf("S3 bucket '%s' must not be public", [resource.address])
}

# Deny unencrypted instances
deny[msg] {
    resource := input.resource_changes[_]
    resource.type == "aws_instance"
    not resource.change.after.root_block_device[0].encrypted
    msg := sprintf("EC2 '%s' must have encrypted root volume", [resource.address])
}

# Require tags on all taggable resources
deny[msg] {
    resource := input.resource_changes[_]
    resource.change.after.tags
    not resource.change.after.tags.Environment
    msg := sprintf("Resource '%s' must have 'Environment' tag", [resource.address])
}

# Deny oversized instances (cost control)
deny[msg] {
    resource := input.resource_changes[_]
    resource.type == "aws_instance"
    forbidden := {"m5.4xlarge", "m5.8xlarge", "m5.12xlarge", "m5.16xlarge", "m5.24xlarge"}
    forbidden[resource.change.after.instance_type]
    msg := sprintf("EC2 '%s' uses oversized instance type '%s' — requires approval",
                   [resource.address, resource.change.after.instance_type])
}

# Run OPA in CI/CD pipeline
terraform plan -out=plan.tfplan
terraform show -json plan.tfplan > plan.json
opa eval --data policy/ --input plan.json "data.terraform.security.deny"
# If any deny rules fire, the pipeline fails

Common Policy Categories

Category	Example Rules
Security	No public endpoints, encryption required, MFA on root
Cost	Max instance size, require spot for dev, reserved for prod
Compliance	Required tags, specific regions only, logging enabled
Operational	Naming conventions, backup policies, DR configuration

Layer 3: Integration Testing (Terratest)

Integration tests deploy real infrastructure, verify it works correctly, then destroy it. They are slower (minutes) but catch issues that static analysis cannot — like networking misconfigurations, IAM permission errors, and service-to-service connectivity problems.

package test

import (
    "testing"
    "time"
    "github.com/gruntwork-io/terratest/modules/terraform"
    "github.com/stretchr/testify/assert"
    http_helper "github.com/gruntwork-io/terratest/modules/http-helper"
)

func TestWebServer(t *testing.T) {
    t.Parallel()

    opts := &terraform.Options{
        TerraformDir: "../modules/web-server",
        Vars: map[string]interface{}{
            "instance_type": "t3.micro",
            "environment":   "test",
        },
    }

    // ALWAYS clean up — even if the test fails
    defer terraform.Destroy(t, opts)

    terraform.InitAndApply(t, opts)

    // Verify the web server responds with 200 OK
    url := terraform.Output(t, opts, "url")
    http_helper.HttpGetWithRetry(t, url, nil, 200, "OK", 10, 5*time.Second)

    // Verify security group is properly configured
    sgId := terraform.Output(t, opts, "security_group_id")
    assert.NotEmpty(t, sgId)

    // Verify the instance is in the private subnet
    subnetId := terraform.Output(t, opts, "subnet_id")
    assert.Contains(t, subnetId, "private")
}

Terratest Best Practices

Practice	Why	Implementation
Always use `defer Destroy`	Prevents resource leaks on test failure	First line after Options
Use parallel tests	Reduce total test time	`t.Parallel()` + unique resource names
Test in a dedicated AWS account	Prevent interference with real environments	Separate “testing” account
Use small instances	Minimize cost during testing	t3.micro/t3.small
Set timeouts	Prevent tests from running forever	`terraform.Options{...MaxRetries: 3}`
Test destruction	Verify clean teardown	Check resources are actually deleted

Layer 4: Drift Detection

Drift occurs when the actual infrastructure state diverges from what’s defined in code. Common causes: console changes, manual fixes during incidents, and automated processes (auto-scaling, self-healing).

# Terraform drift detection using detailed exit codes
terraform plan -detailed-exitcode
# Exit code 0 = no changes (infrastructure matches code)
# Exit code 1 = error (Terraform failed)
# Exit code 2 = changes detected (DRIFT!)

Automated Drift Detection Workflow

# .github/workflows/drift-detection.yml
name: Infrastructure Drift Detection
on:
  schedule:
    - cron: '0 8 * * *'  # Daily at 8 AM UTC

jobs:
  detect-drift:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        environment: [staging, production]
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3

      - name: Terraform Init
        run: terraform init
        working-directory: terraform/${{ matrix.environment }}

      - name: Detect Drift
        id: drift
        run: |
          terraform plan -detailed-exitcode 2>&1 || EXIT_CODE=$?
          if [ "$EXIT_CODE" = "2" ]; then
            echo "drift_detected=true" >> "$GITHUB_OUTPUT"
            echo "⚠️ DRIFT DETECTED in ${{ matrix.environment }}"

            # Send alert to Slack/Teams
            curl -X POST "${{ secrets.SLACK_WEBHOOK }}" \
              -d "{\"text\":\"⚠️ Infrastructure drift detected in ${{ matrix.environment }}! Run terraform plan to review.\"}"
          fi
        working-directory: terraform/${{ matrix.environment }}

How to Handle Drift

Drift Type	Response	Example
Intentional (manual fix during incident)	Update IaC to match reality	Emergency security group change → add to Terraform
Accidental (console click mistake)	Reapply IaC to correct	Someone changed a setting in the console
Auto-scaling (expected variation)	Ignore in drift detection	Instance count changes within ASG bounds
External system (CloudFormation, CDK)	Exclude from detection or import	Resources managed by another tool

CI/CD Pipeline for Infrastructure

The complete pipeline integrates all four testing layers:

# .github/workflows/infrastructure.yml
name: Infrastructure CI/CD
on:
  pull_request:
    paths: ['terraform/**']
  push:
    branches: [main]
    paths: ['terraform/**']

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
      - run: terraform fmt -check -recursive
      - run: terraform init -backend=false
      - run: terraform validate

  security-scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: tfsec
        uses: aquasecurity/tfsec-action@v1.0.0
      - name: checkov
        uses: bridgecrewio/checkov-action@v12

  policy-check:
    needs: [validate]
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
      - run: terraform init && terraform plan -out=plan.tfplan
      - run: terraform show -json plan.tfplan > plan.json
      - run: opa eval --data policy/ --input plan.json "data.terraform.security.deny"

  plan:
    needs: [validate, security-scan, policy-check]
    runs-on: ubuntu-latest
    if: github.event_name == 'pull_request'
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
      - run: terraform init
      - run: terraform plan -no-color -out=plan.tfplan
      - name: Post Plan to PR
        uses: actions/github-script@v7
        with:
          script: |
            const output = `#### Terraform Plan 📖\n\`\`\`\n${process.env.PLAN}\n\`\`\``;
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: output
            });

  apply:
    needs: [validate, security-scan, policy-check]
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    environment: production  # Requires manual approval
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
      - run: terraform init
      - run: terraform apply -auto-approve

Infrastructure Testing Checklist

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For infrastructure advisory, visit garnetgrid.com. :::