Verified by Garnet Grid

How to Automate Infrastructure Testing

Test your infrastructure like application code. Covers Terraform testing, policy-as-code, drift detection, chaos engineering basics, and CI/CD integration.

“It works in staging” is the infrastructure equivalent of “it works on my machine.” Infrastructure testing prevents the 2 AM incident caused by a misconfigured security group, an accidentally deleted database, or a networking change that slipped through code review.

Application teams wouldn’t dream of deploying code without tests. Yet infrastructure — which has a far larger blast radius — is often deployed with nothing more than a manual terraform apply and a hopeful prayer. This guide covers the four layers of infrastructure testing: static analysis, integration testing, policy enforcement, and drift detection.


The Infrastructure Testing Pyramid

         ┌───────────┐
         │   Chaos   │  ← Production resilience
         │Engineering│     (expensive, high value)
        ─┼───────────┼─
        │Integration │  ← Deploy and verify
        │  Tests     │     (Terratest, real resources)
       ─┼────────────┼─
       │ Policy-as-  │  ← Guardrails
       │   Code      │     (OPA, Sentinel)
      ─┼─────────────┼─
      │   Static     │  ← Fast feedback
      │  Analysis    │     (validate, lint, scan)
     ─┴──────────────┴─

Each layer catches different classes of issues. Static analysis catches syntax errors and known misconfigurations in seconds. Policy-as-code enforces organizational rules. Integration tests verify real infrastructure behavior. Chaos engineering validates resilience in production.


Layer 1: Static Analysis

Static analysis is the fastest feedback loop — it runs in seconds without deploying any infrastructure. Every PR should pass these checks before human review begins.

# Terraform validate — syntax and internal consistency
terraform init -backend=false      # Init without backend (faster)
terraform validate                  # Check syntax and references
terraform fmt -check -recursive     # Enforce consistent formatting

# Terraform plan with safety checks
terraform plan -out=plan.tfplan

# Automated plan analysis — alert on destructive changes
terraform show -json plan.tfplan | \
  jq '.resource_changes[] | select(.change.actions | index("delete"))' | \
  jq '.address'
# ^^ If this outputs anything, a resource is being DESTROYED
# Block the PR and require manual approval

Security Scanners

ToolWhat It ChecksSpeedIntegration
tfsecTerraform security misconfigurationsVery fast (seconds)GitHub Actions, pre-commit hook
CheckovTerraform, CloudFormation, KubernetesFast (seconds)GitHub Actions, bridgecrew platform
TrivyIaC, containers, filesystemsFastGitHub Actions, CI/CD
KICSTerraform, Ansible, Docker, K8sFastCI/CD
InfracostCost estimation from plan filesFastGitHub Actions (PR comment)
# Run multiple scanners in CI
tfsec ./terraform/ --minimum-severity HIGH
checkov -d ./terraform/ --framework terraform --hard-fail-on HIGH
infracost breakdown --path ./terraform/ --format json > cost.json

Layer 2: Policy-as-Code

Policy-as-code allows you to codify organizational rules (no public S3 buckets, all resources must be tagged, no instances larger than m5.2xlarge) and enforce them automatically in the CI/CD pipeline.

OPA (Open Policy Agent) — Terraform Plan Validation

# policy/security.rego — OPA policy for Terraform plans

package terraform.security

# Deny public S3 buckets
deny[msg] {
    resource := input.resource_changes[_]
    resource.type == "aws_s3_bucket"
    resource.change.after.acl == "public-read"
    msg := sprintf("S3 bucket '%s' must not be public", [resource.address])
}

# Deny unencrypted instances
deny[msg] {
    resource := input.resource_changes[_]
    resource.type == "aws_instance"
    not resource.change.after.root_block_device[0].encrypted
    msg := sprintf("EC2 '%s' must have encrypted root volume", [resource.address])
}

# Require tags on all taggable resources
deny[msg] {
    resource := input.resource_changes[_]
    resource.change.after.tags
    not resource.change.after.tags.Environment
    msg := sprintf("Resource '%s' must have 'Environment' tag", [resource.address])
}

# Deny oversized instances (cost control)
deny[msg] {
    resource := input.resource_changes[_]
    resource.type == "aws_instance"
    forbidden := {"m5.4xlarge", "m5.8xlarge", "m5.12xlarge", "m5.16xlarge", "m5.24xlarge"}
    forbidden[resource.change.after.instance_type]
    msg := sprintf("EC2 '%s' uses oversized instance type '%s' — requires approval",
                   [resource.address, resource.change.after.instance_type])
}
# Run OPA in CI/CD pipeline
terraform plan -out=plan.tfplan
terraform show -json plan.tfplan > plan.json
opa eval --data policy/ --input plan.json "data.terraform.security.deny"
# If any deny rules fire, the pipeline fails

Common Policy Categories

CategoryExample Rules
SecurityNo public endpoints, encryption required, MFA on root
CostMax instance size, require spot for dev, reserved for prod
ComplianceRequired tags, specific regions only, logging enabled
OperationalNaming conventions, backup policies, DR configuration

Layer 3: Integration Testing (Terratest)

Integration tests deploy real infrastructure, verify it works correctly, then destroy it. They are slower (minutes) but catch issues that static analysis cannot — like networking misconfigurations, IAM permission errors, and service-to-service connectivity problems.

package test

import (
    "testing"
    "time"
    "github.com/gruntwork-io/terratest/modules/terraform"
    "github.com/stretchr/testify/assert"
    http_helper "github.com/gruntwork-io/terratest/modules/http-helper"
)

func TestWebServer(t *testing.T) {
    t.Parallel()

    opts := &terraform.Options{
        TerraformDir: "../modules/web-server",
        Vars: map[string]interface{}{
            "instance_type": "t3.micro",
            "environment":   "test",
        },
    }

    // ALWAYS clean up — even if the test fails
    defer terraform.Destroy(t, opts)

    terraform.InitAndApply(t, opts)

    // Verify the web server responds with 200 OK
    url := terraform.Output(t, opts, "url")
    http_helper.HttpGetWithRetry(t, url, nil, 200, "OK", 10, 5*time.Second)

    // Verify security group is properly configured
    sgId := terraform.Output(t, opts, "security_group_id")
    assert.NotEmpty(t, sgId)

    // Verify the instance is in the private subnet
    subnetId := terraform.Output(t, opts, "subnet_id")
    assert.Contains(t, subnetId, "private")
}

Terratest Best Practices

PracticeWhyImplementation
Always use defer DestroyPrevents resource leaks on test failureFirst line after Options
Use parallel testsReduce total test timet.Parallel() + unique resource names
Test in a dedicated AWS accountPrevent interference with real environmentsSeparate “testing” account
Use small instancesMinimize cost during testingt3.micro/t3.small
Set timeoutsPrevent tests from running foreverterraform.Options{...MaxRetries: 3}
Test destructionVerify clean teardownCheck resources are actually deleted

Layer 4: Drift Detection

Drift occurs when the actual infrastructure state diverges from what’s defined in code. Common causes: console changes, manual fixes during incidents, and automated processes (auto-scaling, self-healing).

# Terraform drift detection using detailed exit codes
terraform plan -detailed-exitcode
# Exit code 0 = no changes (infrastructure matches code)
# Exit code 1 = error (Terraform failed)
# Exit code 2 = changes detected (DRIFT!)

Automated Drift Detection Workflow

# .github/workflows/drift-detection.yml
name: Infrastructure Drift Detection
on:
  schedule:
    - cron: '0 8 * * *'  # Daily at 8 AM UTC

jobs:
  detect-drift:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        environment: [staging, production]
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3

      - name: Terraform Init
        run: terraform init
        working-directory: terraform/${{ matrix.environment }}

      - name: Detect Drift
        id: drift
        run: |
          terraform plan -detailed-exitcode 2>&1 || EXIT_CODE=$?
          if [ "$EXIT_CODE" = "2" ]; then
            echo "drift_detected=true" >> "$GITHUB_OUTPUT"
            echo "⚠️ DRIFT DETECTED in ${{ matrix.environment }}"

            # Send alert to Slack/Teams
            curl -X POST "${{ secrets.SLACK_WEBHOOK }}" \
              -d "{\"text\":\"⚠️ Infrastructure drift detected in ${{ matrix.environment }}! Run terraform plan to review.\"}"
          fi
        working-directory: terraform/${{ matrix.environment }}

How to Handle Drift

Drift TypeResponseExample
Intentional (manual fix during incident)Update IaC to match realityEmergency security group change → add to Terraform
Accidental (console click mistake)Reapply IaC to correctSomeone changed a setting in the console
Auto-scaling (expected variation)Ignore in drift detectionInstance count changes within ASG bounds
External system (CloudFormation, CDK)Exclude from detection or importResources managed by another tool

CI/CD Pipeline for Infrastructure

The complete pipeline integrates all four testing layers:

# .github/workflows/infrastructure.yml
name: Infrastructure CI/CD
on:
  pull_request:
    paths: ['terraform/**']
  push:
    branches: [main]
    paths: ['terraform/**']

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
      - run: terraform fmt -check -recursive
      - run: terraform init -backend=false
      - run: terraform validate

  security-scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: tfsec
        uses: aquasecurity/tfsec-action@v1.0.0
      - name: checkov
        uses: bridgecrewio/checkov-action@v12

  policy-check:
    needs: [validate]
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
      - run: terraform init && terraform plan -out=plan.tfplan
      - run: terraform show -json plan.tfplan > plan.json
      - run: opa eval --data policy/ --input plan.json "data.terraform.security.deny"

  plan:
    needs: [validate, security-scan, policy-check]
    runs-on: ubuntu-latest
    if: github.event_name == 'pull_request'
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
      - run: terraform init
      - run: terraform plan -no-color -out=plan.tfplan
      - name: Post Plan to PR
        uses: actions/github-script@v7
        with:
          script: |
            const output = `#### Terraform Plan 📖\n\`\`\`\n${process.env.PLAN}\n\`\`\``;
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: output
            });

  apply:
    needs: [validate, security-scan, policy-check]
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    environment: production  # Requires manual approval
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
      - run: terraform init
      - run: terraform apply -auto-approve

Infrastructure Testing Checklist

  • terraform validate runs in CI on every PR
  • terraform fmt enforced (no style drift in code)
  • Security scanning (tfsec + Checkov) blocks PRs on high/critical findings
  • Policy-as-code (OPA/Sentinel) enforces organizational guardrails
  • Integration tests (Terratest) for critical modules (networking, security, data)
  • Drift detection running daily with alerts to on-call team
  • Plan output posted as PR comments for reviewer visibility
  • Apply only from main branch (no manual terraform apply in production)
  • State file encrypted, access-controlled, and backed up
  • Blast radius limited (separate state files per environment and component)
  • Cost estimation integrated into PR workflow (Infracost)
  • Destructive changes require explicit manual approval

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For infrastructure advisory, visit garnetgrid.com. :::

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →