DNS Architecture and Resilience
Design DNS infrastructure for reliability, performance, and security. Covers DNS resolution, record types, failover strategies, DNSSEC, DNS load balancing, GeoDNS, and the patterns that prevent DNS from being the single point of failure.
DNS is the internet’s phone book — and its most overlooked single point of failure. When DNS goes down, nothing works: no web traffic, no API calls, no email, no service discovery. Yet most teams spend more time choosing a CSS framework than designing DNS resilience. DNS problems are invisible until they are catastrophic.
DNS Resolution Path
User types: app.example.com
1. Browser cache → Found? Return IP
2. OS cache → Found? Return IP
3. Recursive resolver (ISP/Google/Cloudflare) → Found? Return IP
4. Root nameserver → "Ask .com TLD"
5. .com TLD nameserver → "Ask ns1.example.com"
6. Authoritative nameserver for example.com → "app.example.com = 203.0.113.10"
Total time: 50-200ms (uncached) / <1ms (cached)
TTL: How long resolvers cache the answer
Record Types
A Record: app.example.com → 203.0.113.10 (IPv4)
AAAA Record: app.example.com → 2001:db8::1 (IPv6)
CNAME Record: www.example.com → app.example.com (Alias)
MX Record: example.com → mail.example.com (Email)
TXT Record: example.com → "v=spf1..." (Verification)
SRV Record: _http._tcp.example → app.example.com:443 (Service)
NS Record: example.com → ns1.cloudflare.com (Nameserver)
CAA Record: example.com → letsencrypt.org (CA Authorization)
TTL Guidelines:
Static IPs (databases, internal): 3600s (1 hour)
CDN/Load Balancer endpoints: 300s (5 minutes)
Failover records: 60s (1 minute)
During migration: 30s (30 seconds)
Failover Strategies
# Health-checked DNS failover
class DNSFailover:
"""Automatically remove unhealthy endpoints from DNS."""
def configure_failover(self):
# Primary: US-East data center
self.dns.create_record(
name="app.example.com",
type="A",
value="203.0.113.10",
routing_policy="failover",
failover_role="primary",
health_check=HealthCheck(
protocol="HTTPS",
path="/health",
interval=10, # Check every 10 seconds
threshold=3, # 3 failures = unhealthy
),
)
# Secondary: EU-West data center
self.dns.create_record(
name="app.example.com",
type="A",
value="198.51.100.20",
routing_policy="failover",
failover_role="secondary",
health_check=HealthCheck(
protocol="HTTPS",
path="/health",
interval=10,
threshold=3,
),
)
# GeoDNS: Route users to nearest data center
def configure_geodns(self):
self.dns.create_record(
name="app.example.com",
type="A",
value="203.0.113.10",
routing_policy="geolocation",
geo_location="North America",
)
self.dns.create_record(
name="app.example.com",
type="A",
value="198.51.100.20",
routing_policy="geolocation",
geo_location="Europe",
)
DNSSEC
Without DNSSEC:
User → "What's app.example.com?" → Resolver → Answer: 203.0.113.10
Attacker intercepts → Answer: 192.168.1.1 (attacker's IP)
User unknowingly connects to attacker ← DNS spoofing
With DNSSEC:
User → "What's app.example.com?" → Resolver → Answer: 203.0.113.10
Resolver verifies: Signature matches public key in parent zone?
If yes: ✓ Trust the answer
If no: ✗ Reject (potential spoofing)
Chain of trust: Root → .com → example.com
Each level signs the level below
Anti-Patterns
| Anti-Pattern | Consequence | Fix |
|---|---|---|
| Single DNS provider | Provider outage = total outage | Multi-provider DNS (Route53 + Cloudflare) |
| High TTL during migration | Users reach old IP for hours | Lower TTL to 60s before migration |
| No health checks | DNS points to dead backend | Health-checked DNS with auto-failover |
| No DNSSEC | Vulnerable to DNS spoofing | Enable DNSSEC on authoritative zone |
| CNAME at apex | Non-standard, may break email | ALIAS/ANAME record at apex instead |
DNS is not glamorous, but it is foundational. When DNS is wrong, nothing works and nothing tells you why. Invest in DNS resilience the same way you invest in database resilience.