Event-Driven Automation: Reacting to Changes Instead of Polling
Design automation systems that respond to events in real time rather than running on fixed schedules. Covers event sources, webhook patterns, event buses, CloudWatch Events, GitHub Actions triggers, and building reactive pipelines that scale without cron job sprawl.
Most automation starts with cron. A script runs every 5 minutes, checks if something changed, and acts if it did. This works until you have 200 cron jobs, half of which run for no reason because nothing changed, and the other half miss events because they happened between polling intervals.
Event-driven automation inverts the model. Instead of asking “has anything changed?” every N minutes, you subscribe to change events and react immediately. The infrastructure scales naturally because compute only runs when there is work to do.
Event Sources in Practice
Infrastructure Events
Cloud platforms emit events for virtually every state change:
- AWS EventBridge: EC2 state changes, S3 object creation, ECS task status, CloudFormation stack updates
- Azure Event Grid: Resource group modifications, blob storage events, service health changes
- GCP Eventarc: Cloud Storage, Pub/Sub, Cloud Audit Logs, Cloud Build
Example: Auto-tag any new EC2 instance that lacks a cost-center tag:
{
"source": "aws.ec2",
"detail-type": "EC2 Instance State-change Notification",
"detail": {
"state": "running"
}
}
When this event fires, a Lambda function checks if the instance has required tags and applies defaults if missing. No polling. No cron. Instant compliance.
Application Events
- Webhooks: Stripe payment events, GitHub push/PR events, Slack interactions
- Database CDC: PostgreSQL logical replication, DynamoDB Streams, MongoDB Change Streams
- Message Queues: RabbitMQ, Kafka, SQS — application-generated domain events
Observability Events
- Alert triggers: PagerDuty, Datadog, Prometheus Alertmanager
- Log patterns: CloudWatch Logs subscription filters, Loki alert rules
- Metric thresholds: Auto-scaling on CPU, queue depth, error rate
Webhook Architecture
Webhooks are the simplest form of event-driven automation. A source system sends an HTTP POST to your endpoint when something happens.
Building Reliable Webhook Receivers
from fastapi import FastAPI, Request, HTTPException
import hmac
import hashlib
app = FastAPI()
@app.post("/webhooks/stripe")
async def handle_stripe(request: Request):
# 1. Verify signature
payload = await request.body()
sig = request.headers.get("Stripe-Signature")
if not verify_stripe_signature(payload, sig):
raise HTTPException(status_code=401)
# 2. Parse event
event = json.loads(payload)
# 3. Idempotency check
if await already_processed(event["id"]):
return {"status": "duplicate"}
# 4. Process asynchronously
await queue.enqueue("process_stripe_event", event)
# 5. Acknowledge immediately
return {"status": "accepted"}
Webhook Best Practices
- Verify signatures — Every reputable webhook provider signs payloads. Always verify.
- Respond fast — Return 200 within 3 seconds. Process asynchronously.
- Handle retries — Providers retry on failure. Use idempotency keys.
- Store raw payloads — Log the raw webhook body before processing. Invaluable for debugging.
- Version your handlers — Webhook schemas change. Handle unknown fields gracefully.
Event Bus Patterns
For internal event-driven automation, an event bus decouples producers from consumers:
[Service A] --publish--> [Event Bus] --subscribe--> [Automation 1]
[Service B] --publish--> [Event Bus] --subscribe--> [Automation 2]
[Service C] --publish--> [Event Bus] --subscribe--> [Automation 3]
AWS EventBridge Rules
{
"Source": ["custom.deployment"],
"DetailType": ["DeploymentCompleted"],
"Detail": {
"environment": ["production"],
"status": ["success"]
}
}
Target: Lambda function that posts to Slack, updates a status page, and triggers smoke tests.
Dead Letter Queues
Events that fail processing must not be lost:
EventRule:
Type: AWS::Events::Rule
Properties:
Targets:
- Arn: !GetAtt ProcessFunction.Arn
DeadLetterConfig:
Arn: !GetAtt FailedEventsQueue.Arn
RetryPolicy:
MaximumRetryAttempts: 3
MaximumEventAgeInSeconds: 86400
Replacing Cron with Events
Before: Polling Pattern
# Check for new S3 files every 5 minutes
*/5 * * * * python check_new_uploads.py
Problems:
- 5-minute latency on detection
- Runs 288 times per day even if zero uploads
- Must track “already processed” state
- Fails silently if the host is down
After: Event-Driven
# Triggered by S3 PutObject event
def handle_new_upload(event):
bucket = event["detail"]["bucket"]["name"]
key = event["detail"]["object"]["key"]
process_file(bucket, key)
notify_team(f"Processed {key}")
Benefits:
- Sub-second latency
- Runs only when files are uploaded
- No state tracking needed (each event is self-contained)
- Retry and dead-letter built into the platform
When to Keep Cron
Not everything should be event-driven:
- Daily reports/summaries: No triggering event, time-based by nature
- Data cleanup/archival: Periodic maintenance
- Health checks: Regular heartbeat verification
- Batch aggregation: Collecting data before processing
The rule: if there is a clear triggering event, use events. If the trigger is “it’s Tuesday at 3am,” use cron.
Building Reactive Pipelines
Chain events into pipelines where the output of one automation triggers the next:
[Code Push]
→ [CI Build]
→ [Tests Pass Event]
→ [Deploy to Staging]
→ [Smoke Test Pass Event]
→ [Deploy to Production]
→ [Post-Deploy Event]
→ [Slack Notification + Metrics Reset]
Pipeline Design Principles
- Each stage is independently retriable — Failure at stage 3 does not require re-running stages 1-2
- Events carry context — Each event includes enough data for the next stage to operate without querying back
- Stages are idempotent — Re-processing the same event produces the same result
- Timeouts exist everywhere — A stage that does not emit a completion event within N minutes triggers an alert
Observability for Event-Driven Systems
Event-driven systems are harder to debug because there is no linear request flow. Invest in:
- Event tracing: Correlation IDs that follow events through the entire pipeline
- Event logs: Every event received, processed, or failed — with the full payload
- Lag monitoring: Time between event emission and processing completion
- Dead letter monitoring: Alerts when the DLQ receives items
# Always log event lifecycle
logger.info("event_received", extra={
"event_id": event["id"],
"event_type": event["type"],
"correlation_id": event.get("correlation_id"),
"age_seconds": time.time() - event["timestamp"]
})
Anti-Patterns
| Anti-Pattern | Risk | Fix |
|---|---|---|
| Fire and forget | Lost events, silent failures | Acknowledge only after processing |
| Unbounded fan-out | One event triggers thousands of actions | Rate limit consumers, batch where possible |
| Circular events | A→B→A infinite loop | Include event lineage, detect cycles |
| Tight coupling to event schema | Breaking changes cascade | Version events, use schema registry |
| No dead letter queue | Failed events vanish | Always configure DLQ with alerting |
Event-driven automation is not a silver bullet — it adds complexity in exchange for responsiveness and scalability. Start by replacing your most painful cron jobs. Prove the pattern works. Then expand.