Infrastructure Drift Detection

Infrastructure drift occurs when the actual state of your infrastructure diverges from its declared state in code. Someone SSH’d in and changed a config file. A manual security group rule was added during an incident and never reverted. A developer clicked through the cloud console to test something and forgot to clean up. Drift is inevitable — detecting and fixing it is engineering.

Drift Detection

Sources of Drift:

Manual Changes (most common):
  ☐ Console/portal clicks that bypass IaC
  ☐ SSH into servers for "quick fixes"
  ☐ Manual security group or IAM changes
  ☐ Emergency changes during incidents

Automation Gaps:
  ☐ Terraform apply failed partway through
  ☐ Resources created outside of Terraform
  ☐ State file out of sync with reality

External Forces:
  ☐ Cloud provider changes defaults
  ☐ Auto-scaling creates resources not in state
  ☐ Third-party integrations modify resources

Detection approaches:
  1. Terraform plan (shows diff)
  2. AWS Config rules (continuous monitoring)
  3. Cloud Custodian (policy-based scanning)
  4. Firefly / env0 / Spacelift (drift as a service)

Terraform Drift Detection

class DriftDetector:
    """Automated drift detection using Terraform."""
    
    def detect_drift(self, workspace: str):
        """Run terraform plan and parse for drift."""
        result = self.run_terraform_plan(workspace)
        
        changes = []
        for resource in result.resource_changes:
            if resource.change.actions != ["no-op"]:
                changes.append({
                    "resource": resource.address,
                    "type": resource.type,
                    "action": resource.change.actions,
                    "before": resource.change.before,
                    "after": resource.change.after,
                    "drift_fields": self.diff_fields(
                        resource.change.before,
                        resource.change.after,
                    ),
                })
        
        if changes:
            severity = self.classify_severity(changes)
            self.alert(
                channel="#infrastructure",
                message=f"Drift detected in {workspace}: "
                        f"{len(changes)} resources changed",
                severity=severity,
                changes=changes,
            )
            
            if severity == "critical":
                # Auto-remediate critical drift
                self.auto_remediate(workspace, changes)
            else:
                # Create ticket for non-critical drift
                self.create_ticket(workspace, changes)
        
        return changes
    
    def classify_severity(self, changes):
        """Classify drift severity by resource type."""
        critical_types = [
            "aws_security_group_rule",
            "aws_iam_policy",
            "aws_s3_bucket_policy",
            "aws_kms_key",
        ]
        for change in changes:
            if change["type"] in critical_types:
                return "critical"
        return "warning"

Anti-Patterns

Anti-Pattern	Consequence	Fix
No drift detection at all	Infrastructure diverges silently	Scheduled drift scans (daily minimum)
Detect but never remediate	Drift accumulates, state becomes fiction	Auto-remediate critical drift, ticket the rest
Console access without guardrails	Every console user can cause drift	Read-only console, SCPs preventing manual changes
No incident drift tracking	Emergency changes become permanent drift	Post-incident drift review: revert or codify
Ignore state file health	Corrupt state = no drift detection	State file versioning, locking, regular validation

Infrastructure drift is the gap between what you think you have and what you actually have. The smaller that gap, the safer your infrastructure. Detect drift daily, remediate immediately, and prevent it at the source with policy enforcement.

Drift Detection

Terraform Drift Detection

Anti-Patterns

More in Automation

Ansible for Infrastructure Automation: Playbooks That Do Not Break at 3 AM

Automated Dependency Updates

Automated Change Management Workflow