Infrastructure Drift Detection
Detect and remediate unauthorized changes to infrastructure. Covers drift detection tools, reconciliation strategies, policy enforcement, and the patterns that ensure your actual infrastructure matches your declared state.
Infrastructure drift occurs when the actual state of your infrastructure diverges from its declared state in code. Someone SSH’d in and changed a config file. A manual security group rule was added during an incident and never reverted. A developer clicked through the cloud console to test something and forgot to clean up. Drift is inevitable — detecting and fixing it is engineering.
Drift Detection
Sources of Drift:
Manual Changes (most common):
☐ Console/portal clicks that bypass IaC
☐ SSH into servers for "quick fixes"
☐ Manual security group or IAM changes
☐ Emergency changes during incidents
Automation Gaps:
☐ Terraform apply failed partway through
☐ Resources created outside of Terraform
☐ State file out of sync with reality
External Forces:
☐ Cloud provider changes defaults
☐ Auto-scaling creates resources not in state
☐ Third-party integrations modify resources
Detection approaches:
1. Terraform plan (shows diff)
2. AWS Config rules (continuous monitoring)
3. Cloud Custodian (policy-based scanning)
4. Firefly / env0 / Spacelift (drift as a service)
Terraform Drift Detection
class DriftDetector:
"""Automated drift detection using Terraform."""
def detect_drift(self, workspace: str):
"""Run terraform plan and parse for drift."""
result = self.run_terraform_plan(workspace)
changes = []
for resource in result.resource_changes:
if resource.change.actions != ["no-op"]:
changes.append({
"resource": resource.address,
"type": resource.type,
"action": resource.change.actions,
"before": resource.change.before,
"after": resource.change.after,
"drift_fields": self.diff_fields(
resource.change.before,
resource.change.after,
),
})
if changes:
severity = self.classify_severity(changes)
self.alert(
channel="#infrastructure",
message=f"Drift detected in {workspace}: "
f"{len(changes)} resources changed",
severity=severity,
changes=changes,
)
if severity == "critical":
# Auto-remediate critical drift
self.auto_remediate(workspace, changes)
else:
# Create ticket for non-critical drift
self.create_ticket(workspace, changes)
return changes
def classify_severity(self, changes):
"""Classify drift severity by resource type."""
critical_types = [
"aws_security_group_rule",
"aws_iam_policy",
"aws_s3_bucket_policy",
"aws_kms_key",
]
for change in changes:
if change["type"] in critical_types:
return "critical"
return "warning"
Anti-Patterns
| Anti-Pattern | Consequence | Fix |
|---|---|---|
| No drift detection at all | Infrastructure diverges silently | Scheduled drift scans (daily minimum) |
| Detect but never remediate | Drift accumulates, state becomes fiction | Auto-remediate critical drift, ticket the rest |
| Console access without guardrails | Every console user can cause drift | Read-only console, SCPs preventing manual changes |
| No incident drift tracking | Emergency changes become permanent drift | Post-incident drift review: revert or codify |
| Ignore state file health | Corrupt state = no drift detection | State file versioning, locking, regular validation |
Infrastructure drift is the gap between what you think you have and what you actually have. The smaller that gap, the safer your infrastructure. Detect drift daily, remediate immediately, and prevent it at the source with policy enforcement.