DNS Architecture and Resilience

DNS is the internet’s phone book — and its most overlooked single point of failure. When DNS goes down, nothing works: no web traffic, no API calls, no email, no service discovery. Yet most teams spend more time choosing a CSS framework than designing DNS resilience. DNS problems are invisible until they are catastrophic.

DNS Resolution Path

User types: app.example.com

1. Browser cache → Found? Return IP
2. OS cache → Found? Return IP
3. Recursive resolver (ISP/Google/Cloudflare) → Found? Return IP
4. Root nameserver → "Ask .com TLD"
5. .com TLD nameserver → "Ask ns1.example.com"
6. Authoritative nameserver for example.com → "app.example.com = 203.0.113.10"

Total time: 50-200ms (uncached) / <1ms (cached)
TTL: How long resolvers cache the answer

Record Types

A Record:     app.example.com     → 203.0.113.10       (IPv4)
AAAA Record:  app.example.com     → 2001:db8::1         (IPv6)
CNAME Record: www.example.com     → app.example.com     (Alias)
MX Record:    example.com         → mail.example.com    (Email)
TXT Record:   example.com         → "v=spf1..."         (Verification)
SRV Record:   _http._tcp.example  → app.example.com:443 (Service)
NS Record:    example.com         → ns1.cloudflare.com  (Nameserver)
CAA Record:   example.com         → letsencrypt.org     (CA Authorization)

TTL Guidelines:
  Static IPs (databases, internal):     3600s (1 hour)
  CDN/Load Balancer endpoints:           300s (5 minutes)
  Failover records:                       60s (1 minute)
  During migration:                       30s (30 seconds)

Failover Strategies

# Health-checked DNS failover
class DNSFailover:
    """Automatically remove unhealthy endpoints from DNS."""
    
    def configure_failover(self):
        # Primary: US-East data center
        self.dns.create_record(
            name="app.example.com",
            type="A",
            value="203.0.113.10",
            routing_policy="failover",
            failover_role="primary",
            health_check=HealthCheck(
                protocol="HTTPS",
                path="/health",
                interval=10,  # Check every 10 seconds
                threshold=3,  # 3 failures = unhealthy
            ),
        )
        
        # Secondary: EU-West data center
        self.dns.create_record(
            name="app.example.com",
            type="A",
            value="198.51.100.20",
            routing_policy="failover",
            failover_role="secondary",
            health_check=HealthCheck(
                protocol="HTTPS",
                path="/health",
                interval=10,
                threshold=3,
            ),
        )
    
    # GeoDNS: Route users to nearest data center
    def configure_geodns(self):
        self.dns.create_record(
            name="app.example.com",
            type="A",
            value="203.0.113.10",
            routing_policy="geolocation",
            geo_location="North America",
        )
        self.dns.create_record(
            name="app.example.com",
            type="A",
            value="198.51.100.20",
            routing_policy="geolocation",
            geo_location="Europe",
        )

DNSSEC

Without DNSSEC:
  User → "What's app.example.com?" → Resolver → Answer: 203.0.113.10
  Attacker intercepts → Answer: 192.168.1.1 (attacker's IP)
  User unknowingly connects to attacker  ← DNS spoofing
  
With DNSSEC:
  User → "What's app.example.com?" → Resolver → Answer: 203.0.113.10
  Resolver verifies: Signature matches public key in parent zone?
  If yes: ✓ Trust the answer
  If no: ✗ Reject (potential spoofing)
  
  Chain of trust: Root → .com → example.com
  Each level signs the level below

Anti-Patterns

Anti-Pattern	Consequence	Fix
Single DNS provider	Provider outage = total outage	Multi-provider DNS (Route53 + Cloudflare)
High TTL during migration	Users reach old IP for hours	Lower TTL to 60s before migration
No health checks	DNS points to dead backend	Health-checked DNS with auto-failover
No DNSSEC	Vulnerable to DNS spoofing	Enable DNSSEC on authoritative zone
CNAME at apex	Non-standard, may break email	ALIAS/ANAME record at apex instead

DNS is not glamorous, but it is foundational. When DNS is wrong, nothing works and nothing tells you why. Invest in DNS resilience the same way you invest in database resilience.

DNS Resolution Path

Record Types

Failover Strategies

DNSSEC

Anti-Patterns

More in Networking

API Gateway Networking: Traffic Management at the Edge

BGP Fundamentals for Engineers

CDN Architecture: Serving Content at the Edge