Infrastructure as Code: Terraform Patterns That Scale
Write Terraform code that multiple teams can maintain without stepping on each other. Covers module design, state management, workspace strategies, drift detection, CI/CD integration, and the organizational patterns that prevent Terraform from becoming a bottleneck.
Terraform starts simple: write some HCL, run terraform apply, infrastructure appears. At 50 resources, it is manageable. At 500 resources, it is a single state file that takes 10 minutes to plan, where one developer’s change blocks everyone else, and nobody is sure which resources are managed by Terraform and which were created manually in the console.
This guide covers how to structure Terraform for organizations where multiple teams need to manage infrastructure without creating a central bottleneck or an unmaintainable monolith.
State Management: The Foundation
State is the core concept of Terraform. Get it wrong and everything else falls apart.
Remote State Configuration
# backend.tf — Always use remote state in production
terraform {
backend "s3" {
bucket = "company-terraform-state"
key = "services/checkout-api/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-state-lock" # Prevents concurrent applies
encrypt = true
}
}
State Splitting Strategy
| Strategy | When to Use | Risk |
|---|---|---|
| Monolithic (one state) | < 50 resources, single team | Slow plans, single point of failure |
| Per-environment | Multiple environments (staging/prod) | Still large per-environment |
| Per-service | Microservices, team ownership | More state files to manage |
| Per-layer | Separate network, compute, data | Cross-layer dependencies |
Recommended: Per-service + Per-environment
terraform/
├── modules/ # Shared, reusable modules
│ ├── vpc/
│ ├── ecs-service/
│ └── rds/
├── environments/
│ ├── staging/
│ │ ├── networking/ # VPC, subnets, NAT
│ │ │ ├── main.tf
│ │ │ └── terraform.tfstate # Separate state
│ │ ├── checkout-api/ # Service infrastructure
│ │ │ ├── main.tf
│ │ │ └── terraform.tfstate # Separate state
│ │ └── shared-database/
│ │ ├── main.tf
│ │ └── terraform.tfstate
│ └── production/
│ ├── networking/
│ ├── checkout-api/
│ └── shared-database/
Module Design
Good modules are reusable, versioned, and have clear interfaces.
Module Interface Rules
# modules/ecs-service/variables.tf
variable "service_name" {
description = "Name of the ECS service"
type = string
validation {
condition = can(regex("^[a-z][a-z0-9-]+$", var.service_name))
error_message = "Service name must be lowercase alphanumeric with hyphens."
}
}
variable "container_image" {
description = "Docker image URI (e.g., 123456789.dkr.ecr.us-east-1.amazonaws.com/my-app:v1.2.3)"
type = string
}
variable "cpu" {
description = "CPU units (1 vCPU = 1024 units)"
type = number
default = 256
}
variable "memory" {
description = "Memory in MB"
type = number
default = 512
}
variable "environment" {
description = "Environment name (staging, production)"
type = string
validation {
condition = contains(["staging", "production"], var.environment)
error_message = "Environment must be 'staging' or 'production'."
}
}
variable "tags" {
description = "Tags to apply to all resources"
type = map(string)
default = {}
}
Module Versioning
# Pin module versions — never use unversioned source references
# ✅ Good: pinned version
module "checkout_api" {
source = "git::https://github.com/company/terraform-modules.git//ecs-service?ref=v2.3.1"
service_name = "checkout-api"
container_image = "123456789.dkr.ecr.us-east-1.amazonaws.com/checkout:v1.5.0"
environment = "production"
}
# ❌ Bad: unpinned (uses latest, breaks without warning)
module "checkout_api" {
source = "git::https://github.com/company/terraform-modules.git//ecs-service"
}
Terraform CI/CD Integration
Never run terraform apply from a laptop in production. All infrastructure changes should flow through CI/CD.
# GitHub Actions: Terraform CI/CD
name: Terraform
on:
pull_request:
paths: ['terraform/**']
push:
branches: [main]
paths: ['terraform/**']
jobs:
plan:
runs-on: ubuntu-latest
if: github.event_name == 'pull_request'
steps:
- uses: actions/checkout@v4
- name: Terraform Init
run: terraform init
working-directory: terraform/environments/production/checkout-api
- name: Terraform Plan
run: terraform plan -out=tfplan -no-color
working-directory: terraform/environments/production/checkout-api
- name: Post plan to PR
uses: actions/github-script@v7
with:
script: |
const plan = require('fs').readFileSync('terraform/tfplan.txt', 'utf8');
github.rest.issues.createComment({
issue_number: context.issue.number,
body: `## Terraform Plan\n\`\`\`\n${plan}\n\`\`\``
});
apply:
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main' && github.event_name == 'push'
steps:
- uses: actions/checkout@v4
- name: Terraform Init
run: terraform init
- name: Terraform Apply
run: terraform apply -auto-approve
Drift Detection
Infrastructure drift happens when someone changes a resource outside of Terraform — through the cloud console, CLI, or another tool. If you do not detect drift, your Terraform state becomes a lie.
# Run plan regularly to detect drift
# Schedule this in CI (daily or on-commit)
terraform plan -detailed-exitcode
# Exit codes:
# 0 = No changes (state matches reality)
# 1 = Error
# 2 = Changes detected (DRIFT!)
| Drift Response | When to Use |
|---|---|
| Auto-remediate | Non-critical resources, well-tested modules |
| Alert and investigate | Production resources, security-sensitive configs |
| Import into state | Legitimate change made outside Terraform |
| Ignore | Never. Drift always gets worse. |
Common Mistakes
| Mistake | Consequence | Prevention |
|---|---|---|
| Hardcoded values | Cannot reuse across environments | Use variables with validation |
| No state locking | Concurrent applies corrupt state | DynamoDB lock table (AWS), GCS lock (GCP) |
| Giant state files | 10-minute plans, blast radius of entire infrastructure | Split state per-service or per-layer |
| No module versioning | Module changes break consumers | Pin versions, use semver |
| Manual console changes | Drift between state and reality | Run daily drift detection |
Secrets in .tf files | Credentials in version control | Use data sources, SSM Parameter Store, Vault |
Implementation Checklist
- Use remote state with locking (S3 + DynamoDB or GCS + built-in locking)
- Split state files: per-service or per-layer, never one giant state
- Create reusable modules with clear interfaces (variables, outputs, validation)
- Pin module versions — never use unversioned source references
- Run
terraform planon every PR, post results as PR comment - Run
terraform applyonly from CI/CD, never from laptops (production) - Schedule daily drift detection and alert on any unexpected changes
- Never store secrets in
.tffiles — use parameter store or Vault - Tag all resources with owner, service, environment, and managed-by
- Review state files quarterly: remove dead resources, consolidate where useful