Infrastructure as Code Testing Strategies

Infrastructure as Code (IaC) changed how we provision infrastructure. But most teams still deploy IaC changes without testing — they run terraform plan, eyeball the diff, and hope for the best. This is the equivalent of deploying application code without running tests. At scale, untested infrastructure changes cause outages that are harder to diagnose and longer to resolve than application bugs.

Testing IaC requires different strategies than testing application code. You can’t unit test a VPC the same way you unit test a function. But you can validate configurations, simulate plans, enforce policies, and integration test in ephemeral environments before anything touches production.

The IaC Testing Pyramid

Layer	What It Tests	Speed	Cost
Static Analysis	Syntax, formatting, security rules	Seconds	Free
Unit Tests	Module logic, variable validation	Seconds	Free
Policy Tests	Compliance rules, guardrails	Seconds	Free
Plan Tests	Expected resource changes	Minutes	Free
Integration Tests	Actual infrastructure behavior	10-30 min	Cloud costs
E2E Tests	Full stack deployment	30-60 min	Cloud costs

Layer 1: Static Analysis

Run these on every commit. They catch 40% of issues before any cloud API is called.

# Terraform
terraform fmt -check -recursive
terraform validate
tflint --recursive

# Security scanning
tfsec .
checkov -d .

These tools catch:

Insecure defaults (S3 bucket without encryption, security group open to 0.0.0.0/0)
Syntax errors and deprecated features
Missing required tags
Resource naming convention violations

Layer 2: Policy-as-Code with OPA

Open Policy Agent (OPA) lets you write compliance rules that block non-compliant infrastructure before it’s created.

# policy/security.rego

# Deny public S3 buckets
deny[msg] {
    resource := input.planned_values.root_module.resources[_]
    resource.type == "aws_s3_bucket"
    resource.values.acl == "public-read"
    msg := sprintf("S3 bucket '%s' cannot be public", [resource.address])
}

# Require encryption on all RDS instances
deny[msg] {
    resource := input.planned_values.root_module.resources[_]
    resource.type == "aws_db_instance"
    not resource.values.storage_encrypted
    msg := sprintf("RDS instance '%s' must have encryption enabled", [resource.address])
}

# Enforce instance size limits
deny[msg] {
    resource := input.planned_values.root_module.resources[_]
    resource.type == "aws_instance"
    allowed := {"t3.micro", "t3.small", "t3.medium", "t3.large", "t3.xlarge"}
    not allowed[resource.values.instance_type]
    msg := sprintf("Instance '%s' uses disallowed type '%s'", 
        [resource.address, resource.values.instance_type])
}

Run OPA against terraform plan -out=plan.json && terraform show -json plan.json:

opa eval --data policy/ --input plan.json "data.terraform.deny[msg]"

Layer 3: Integration Testing

For critical infrastructure, spin up real resources in an isolated test account, validate they work, then tear everything down.

# test_network.py (using Terratest patterns)
import pytest
import subprocess
import json

class TestNetworkModule:
    @pytest.fixture(autouse=True)
    def setup_teardown(self, tmp_path):
        # Apply infrastructure
        subprocess.run(["terraform", "init"], cwd="modules/network", check=True)
        subprocess.run(["terraform", "apply", "-auto-approve", 
                        f"-var=env=test-{uuid4().hex[:8]}"], 
                       cwd="modules/network", check=True)
        
        yield
        
        # Destroy after test
        subprocess.run(["terraform", "destroy", "-auto-approve"],
                       cwd="modules/network", check=True)
    
    def test_vpc_has_correct_cidr(self):
        output = subprocess.run(
            ["terraform", "output", "-json"],
            cwd="modules/network", capture_output=True, text=True
        )
        outputs = json.loads(output.stdout)
        assert outputs["vpc_cidr"]["value"] == "10.0.0.0/16"
    
    def test_private_subnets_not_publicly_accessible(self):
        # Use AWS SDK to verify subnet routing
        pass

Cost control: Integration tests run in a dedicated test account with aggressive auto-cleanup. Set a maximum test duration (30 minutes) with automatic terraform destroy on timeout.

Drift Detection

Infrastructure drift — when reality diverges from your IaC definitions — is the silent killer. Common causes: manual console changes, out-of-band scripts, and cloud provider auto-updates.

Detection Protocol:

# Run daily via CI/CD
terraform plan -detailed-exitcode
# Exit code 0: No changes (in sync)
# Exit code 1: Error
# Exit code 2: Changes detected (drift!)

When drift is detected:

Alert the infrastructure team immediately
Determine if drift is intentional (emergency fix) or accidental
Either update IaC to match reality or revert the drift
Document the root cause and prevent recurrence

CI/CD Pipeline for Infrastructure

Commit → Lint → Validate → Policy Check → Plan → Review → Apply → Verify

Non-negotiable rules:

plan output must be reviewed by a human before apply (for production)
Policy check failures block the pipeline (no exceptions)
Apply to staging before production (always)
Keep plan and apply in the same pipeline run (prevent plan staleness)
Lock state files during applies (prevent concurrent modifications)

The teams that test their infrastructure as rigorously as their application code deploy with confidence. Everyone else deploys with anxiety — and anxiety scales poorly.

The IaC Testing Pyramid

Layer 1: Static Analysis

Layer 2: Policy-as-Code with OPA

Layer 3: Integration Testing

Drift Detection

CI/CD Pipeline for Infrastructure

More in Automation

Ansible for Infrastructure Automation: Playbooks That Do Not Break at 3 AM

Automated Dependency Updates

Automated Change Management Workflow